Group Relative Policy Optimization
Cross-source consensus on Group Relative Policy Optimization from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- GRPO required one backward pass per training step in the reported comparison. — Gradient Extrapolation-Based Policy Optimization
- GXPO falls back to single-pass GRPO if the trajectory-aware shutoff rule disables extrapolation. — Gradient Extrapolation-Based Policy Optimization
- GXPO reached GRPO's peak accuracy threshold faster than GRPO on Llama3.2-3B in steps, time, and backward passes. — Gradient Extrapolation-Based Policy Optimization
- Standard GRPO-style training updates the current policy using the gradient evaluated at that policy. — Gradient Extrapolation-Based Policy Optimization
- Single-step GRPO-style updates are cheap but may miss gradient-trajectory information that could improve update directions. — Gradient Extrapolation-Based Policy Optimization