Online Data Selection
Cross-source consensus on Online Data Selection from 1 sources and 6 claims.
1 sources · 6 claims
Dosage & preparation
Risks & contraindications
Comparisons
Highlighted claims
- Offline filtering methods are inherently static and become off-policy as the model's capabilities evolve and the effective difficulty of each prompt shifts. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- Reinforce-Rej applies binary keep-or-drop decisions when prompts become all-correct or all-incorrect, discarding the continuous gradient signal and potentially destabilizing the training distribution. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- The DPS method predicts prompt solvability via a hidden Markov model but requires maintaining a learned generative state model with non-trivial computational overhead. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- Uniform sampling at the same 40% retention rate consistently underperforms the full-data baseline, confirming that LZE's gains come from principled selection rather than reduced data volume. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- RAFT underperforms even the full RL baseline because retaining only positively rewarded rollouts for supervised fine-tuning sacrifices the policy-gradient-driven exploration essential for generalization. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- Performance peaks at a selection ratio κ=0.4; lower ratios starve the policy gradient, while higher ratios allow all-correct and all-incorrect prompts to dilute the gradient signal. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training