Reinforcement Learning Post-Training
Cross-source consensus on Reinforcement Learning Post-Training from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Risks & contraindications
Background
Evidence quality
Highlighted claims
- Reinforcement learning post-training with verifiable rewards has become the dominant approach for improving mathematical reasoning in large language models. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- GRPO and DAPO generate multiple rollouts per prompt, aggregate rewards within each group, and update the policy, but allocate compute nearly uniformly across all prompts. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- When a prompt's pass rate approaches 1 or 0, group-relative advantages collapse to zero and no gradient contribution is produced, wasting both extremes of compute allocation. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- All LZE experiments use the Verl framework with GRPO/DAPO training, 8 rollouts per prompt, and KL-divergence penalty removed. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- All experiments use binary verifiable rewards from a rule-based answer-matching checker, and whether the design transfers to continuous or noisy reward settings is an open question. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training