Autoregressive Video Diffusion

Cross-source consensus on Autoregressive Video Diffusion from 1 sources and 4 claims.

1 sources · 4 claims

How it works

Autoregressive video diffusion systems generate video chunk-by-chunk in a causal frame-wise manner and rely on KV caching for streaming inference. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
Softmax self-attention inside Diffusion Transformers incurs O(N²) compute and O(N) memory scaling with sequence length. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
The KV cache for a 5-second 480p video can exceed 34 GB, and attention accounts for approximately 75% of total latency after only 14 generated chunks. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
AR video diffusion has a heterogeneous attention structure where intra-frame attention is bidirectional and inter-frame attention is causal, unlike the homogeneous causal attention in LLMs. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion