Policy Gradient Theory

Cross-source consensus on Policy Gradient Theory from 1 sources and 5 claims.

1 sources · 5 claims

How it works

Removing the uncertainty term 4p(1-p) causes the largest single-component performance degradation in ablation studies, validating the theoretical result. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Removing the difficulty anchor causes the selector to collapse toward initially easy prompts that transiently sit near p=0.5 due to policy noise. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
The Bernoulli variance p(1-p) governs per-prompt gradient informativeness, directly justifying the uncertainty term as the primary driver of the Energy Score. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
The momentum term m_i(t) is the output of a complementary causal high-pass filter on the pass-rate sequence, isolating rapid temporal changes in policy performance. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Theorem 1's theoretical guarantees depend on a homogeneity approximation for score-function norms across reward outcomes, which may not hold in all settings. — Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training