Autoregressive Video Diffusion
Cross-source consensus on Autoregressive Video Diffusion from 1 sources and 4 claims.
1 sources · 4 claims
How it works
Risks & contraindications
Comparisons
Highlighted claims
- Autoregressive video diffusion systems generate video chunk-by-chunk in a causal frame-wise manner and rely on KV caching for streaming inference. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- Softmax self-attention inside Diffusion Transformers incurs O(N²) compute and O(N) memory scaling with sequence length. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- The KV cache for a 5-second 480p video can exceed 34 GB, and attention accounts for approximately 75% of total latency after only 14 generated chunks. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- AR video diffusion has a heterogeneous attention structure where intra-frame attention is bidirectional and inter-frame attention is causal, unlike the homogeneous causal attention in LLMs. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion