Online Data Selection

Cross-source consensus on Online Data Selection from 1 sources and 6 claims.

1 sources · 6 claims

Dosage & preparation

Risks & contraindications

Comparisons

Highlighted claims

Offline filtering methods are inherently static and become off-policy as the model's capabilities evolve and the effective difficulty of each prompt shifts. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Reinforce-Rej applies binary keep-or-drop decisions when prompts become all-correct or all-incorrect, discarding the continuous gradient signal and potentially destabilizing the training distribution. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
The DPS method predicts prompt solvability via a hidden Markov model but requires maintaining a learned generative state model with non-trivial computational overhead. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Uniform sampling at the same 40% retention rate consistently underperforms the full-data baseline, confirming that LZE's gains come from principled selection rather than reduced data volume. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
RAFT underperforms even the full RL baseline because retaining only positively rewarded rollouts for supervised fine-tuning sacrifices the policy-gradient-driven exploration essential for generalization. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Performance peaks at a selection ratio κ=0.4; lower ratios starve the policy gradient, while higher ratios allow all-correct and all-incorrect prompts to dilute the gradient signal. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training