ARL2

Cross-source consensus on ARL2 from 1 sources and 7 claims.

1 sources · 7 claims

How it works

Benefits

Risks & contraindications

Comparisons

Background

Highlighted claims

ARL2 decomposes self-attention into two parallel branches: softmax attention for intra-frame spatial structure and a fixed-size recurrent state for inter-frame temporal memory. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2 is the first work to convert a pretrained autoregressive video diffusion model into a hybrid linear attention architecture. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
At 1,005 frames ARL2 completes generation at 43 GB while Causal Forcing runs out of memory at over 91 GB. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2 at 50% layer replacement achieves the highest Quality Average (87.17) among all distilled models evaluated, improving temporal flickering by +0.99 and motion smoothness by +1.35 over its teacher. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2 with 1.3B parameters matches the quality of MAGI-1, which has 4.5B parameters. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2 underperforms the teacher model on spatial relationships and human actions due to the inherent constraint of a fixed-size recurrent state. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2's improvements in temporal flickering and motion smoothness are attributed to the recurrent state providing more stable long-range context than a linearly growing KV cache. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion