Reinforcement Learning Post-Training

Cross-source consensus on Reinforcement Learning Post-Training from 1 sources and 5 claims.

1 sources · 5 claims

How it works

Reinforcement learning post-training with verifiable rewards has become the dominant approach for improving mathematical reasoning in large language models. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
GRPO and DAPO generate multiple rollouts per prompt, aggregate rewards within each group, and update the policy, but allocate compute nearly uniformly across all prompts. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
When a prompt's pass rate approaches 1 or 0, group-relative advantages collapse to zero and no gradient contribution is produced, wasting both extremes of compute allocation. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
All LZE experiments use the Verl framework with GRPO/DAPO training, 8 rollouts per prompt, and KL-divergence penalty removed. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
All experiments use binary verifiable rewards from a rule-based answer-matching checker, and whether the design transfers to continuous or noisy reward settings is an open question. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training