Importance-Corrected GRPO

Cross-source consensus on Importance-Corrected GRPO from 1 sources and 5 claims.

1 sources · 5 claims

How it works

Aborted rollouts have advantages, returns, and response masks zeroed. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET applies importance-corrected, gradient-masked GRPO during the update phase. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
The unbiasedness result applies formally under an action-independent baseline, while the practical GRPO implementation has residual coupling from group-normalized advantages. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
The strongest unbiasedness guarantees do not fully cover the implementation’s group-normalized advantages and token-mean aggregation. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
Empirically, the pessimistic importance-sampling surcharge stayed near 1x and below 1.5x in reported runs. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards