Linear Attention
Cross-source consensus on Linear Attention from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- Speedup from replacing softmax attention with linear recurrence grows with video length. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- Headwise gating with 18K parameters per layer achieves the best quality-efficiency tradeoff among gating granularities tested. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- Linear attention can be reinterpreted as a form of associative memory, with modern variants introducing gating and error-correction for stable long-context updates. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- Elementwise gating performs worse than headwise gating because its 130× larger parameterization converges poorly with limited training steps. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- The learned headwise gate values remain near 0.5, confirming that both the softmax intra-frame and linear inter-frame branches remain active as a content-dependent mixture. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion