Baselines
Cross-source consensus on Baselines from 1 sources and 5 claims.
1 sources · 5 claims
Comparisons
Highlighted claims
- Prior efficient-RLVR methods usually control either prompt or rollout selection, or within-rollout pruning, rather than both count and length jointly. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- DAPO post-filters after rollouts are generated, so it does not save the main generation cost. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- Fixed length caps are treated as suboptimal because they can penalize unfinished coherent reasoning and remove useful long reasoning chains. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- VIP adapts rollout allocation but treats rollout length as exogenous. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
- ARRoL saved only a small slice of generated tokens in the math setting because few rollouts reached its inspection point. — DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards