Linear Attention

Cross-source consensus on Linear Attention from 1 sources and 5 claims.

1 sources · 5 claims

How it works

Speedup from replacing softmax attention with linear recurrence grows with video length. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
Headwise gating with 18K parameters per layer achieves the best quality-efficiency tradeoff among gating granularities tested. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
Linear attention can be reinterpreted as a form of associative memory, with modern variants introducing gating and error-correction for stable long-context updates. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
Elementwise gating performs worse than headwise gating because its 130× larger parameterization converges poorly with limited training steps. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
The learned headwise gate values remain near 0.5, confirming that both the softmax intra-frame and linear inter-frame branches remain active as a content-dependent mixture. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion