lm_head Gradient Norm

Cross-source consensus on lm_head Gradient Norm from 1 sources and 4 claims.

1 sources · 4 claims

How it works

Among evaluated signals, only the lm_head gradient norm produced a sharp spike synchronized with collapse onset. — When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
The lm_head gradient norm lower-bounds empirical Pearson chi-squared divergence at the batch level. — When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
The lm_head gradient receives the raw error signal directly, while intermediate layer gradients are filtered through a Jacobian. — When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
A surge in lm_head gradient norm is interpreted as a certified early-warning indicator rather than a heuristic proxy. — When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR