Reinforcement Learning with Verifiable Rewards

Cross-source consensus on Reinforcement Learning with Verifiable Rewards from 1 sources and 5 claims.

1 sources · 5 claims

Uses

Reinforcement learning with verifiable rewards is used to improve mathematical and long-form reasoning in large language models because generated answers can be checked automatically. — Gradient Extrapolation-Based Policy Optimization
The experiments trained GRPO-family reasoning RL on Qwen2.5 and Llama3.2 instruction models using Hendrycks MATH Level 3-5 training data. — Gradient Extrapolation-Based Policy Optimization
GXPO can be added to GRPO-style RLVR pipelines with minimal changes to the data path. — Gradient Extrapolation-Based Policy Optimization
The paper targets a cost-quality tension in reasoning RL caused by expensive extra backward passes and potentially weaker single-step updates. — Gradient Extrapolation-Based Policy Optimization
More efficient RLVR may accelerate stronger reasoning models, so GXPO-trained models should still undergo standard safety, misuse, and reliability evaluation before deployment. — Gradient Extrapolation-Based Policy Optimization