Gradient Extrapolation-Based Policy Optimization
Cross-source consensus on Gradient Extrapolation-Based Policy Optimization from 1 sources and 6 claims.
1 sources · 6 claims
How it works
Dosage & preparation
Comparisons
Highlighted claims
- GXPO is introduced as a change to the policy update rule rather than a replacement for the broader reinforcement learning pipeline. — Gradient Extrapolation-Based Policy Optimization
- GXPO keeps the same rollout batch, rewards, advantages, KL regularization, and GRPO loss while changing only the parameter update. — Gradient Extrapolation-Based Policy Optimization
- GXPO replaces a single GRPO update with a three-backward-pass active-phase update. — Gradient Extrapolation-Based Policy Optimization
- GXPO computes a real corrective gradient after repositioning partway toward the extrapolated point. — Gradient Extrapolation-Based Policy Optimization
- GXPO estimates a virtual K-step policy point by scaling an observed two-step displacement with a geometric-sum ratio. — Gradient Extrapolation-Based Policy Optimization
- GXPO was configured with alpha_0 of 0.5, delta of 1e-8, tau of 0.5, and trajectory-aware shutoff in the Qwen2.5-7B setup. — Gradient Extrapolation-Based Policy Optimization