Policy Gradient Theory
Cross-source consensus on Policy Gradient Theory from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Risks & contraindications
Evidence quality
Highlighted claims
- Removing the uncertainty term 4p(1-p) causes the largest single-component performance degradation in ablation studies, validating the theoretical result. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- Removing the difficulty anchor causes the selector to collapse toward initially easy prompts that transiently sit near p=0.5 due to policy noise. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- The Bernoulli variance p(1-p) governs per-prompt gradient informativeness, directly justifying the uncertainty term as the primary driver of the Energy Score. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- The momentum term m_i(t) is the output of a complementary causal high-pass filter on the pass-rate sequence, isolating rapid temporal changes in policy performance. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
- Theorem 1's theoretical guarantees depend on a homogeneity approximation for score-function norms across reward outcomes, which may not hold in all settings. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training