Gradient Computation Cost

Cross-source consensus on Gradient Computation Cost from 1 sources and 4 claims.

1 sources · 4 claims

How it works

In simulator-based GRPO training of a 7B VLA model, gradient computation accounts for approximately 78% of wall-clock time per training step while rollout collection accounts for only approximately 21%. — Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
Prior efficiency work implicitly assumed rollout collection was the dominant cost, but this paper challenges that assumption with direct measurement. — Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
The finding that gradient computation dominates training cost inverts the assumption behind most prior VLA RL efficiency work and motivates treating gradient allocation as an explicit design axis. — Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
An earlier approach of branching at decision-critical timesteps was abandoned because it added exponential rollout overhead rather than reducing gradient computation. — Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking