Gradient Extrapolation-Based Policy Optimization

Cross-source consensus on Gradient Extrapolation-Based Policy Optimization from 1 sources and 6 claims.

1 sources · 6 claims

How it works

GXPO is introduced as a change to the policy update rule rather than a replacement for the broader reinforcement learning pipeline. — Gradient Extrapolation-Based Policy Optimization
GXPO keeps the same rollout batch, rewards, advantages, KL regularization, and GRPO loss while changing only the parameter update. — Gradient Extrapolation-Based Policy Optimization
GXPO replaces a single GRPO update with a three-backward-pass active-phase update. — Gradient Extrapolation-Based Policy Optimization
GXPO computes a real corrective gradient after repositioning partway toward the extrapolated point. — Gradient Extrapolation-Based Policy Optimization
GXPO estimates a virtual K-step policy point by scaling an observed two-step displacement with a geometric-sum ratio. — Gradient Extrapolation-Based Policy Optimization
GXPO was configured with alpha_0 of 0.5, delta of 1e-8, tau of 0.5, and trajectory-aware shutoff in the Qwen2.5-7B setup. — Gradient Extrapolation-Based Policy Optimization