Importance-Corrected GRPO
Cross-source consensus on Importance-Corrected GRPO from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Evidence quality
Other
Highlighted claims
- Aborted rollouts have advantages, returns, and response masks zeroed. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- DUET applies importance-corrected, gradient-masked GRPO during the update phase. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- The unbiasedness result applies formally under an action-independent baseline, while the practical GRPO implementation has residual coupling from group-normalized advantages. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- The strongest unbiasedness guarantees do not fully cover the implementation’s group-normalized advantages and token-mean aggregation. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- Empirically, the pessimistic importance-sampling surcharge stayed near 1x and below 1.5x in reported runs. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards