Distillation

Cross-source consensus on Distillation from 1 sources and 5 claims.

1 sources · 5 claims

How it works

ARL2 is converted from a pretrained softmax model through a progressive two-stage distillation pipeline that trains fewer than 2% of backbone parameters. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
The full two-stage distillation pipeline consumes approximately 156 H100 GPU-hours. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
The sensitivity-guided layer selection framework protects Hybrid-Sensitive layers from replacement to avoid irreversible imaging quality degradation, while allowing Hybrid-Recoverable gaps to close in joint distillation. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2's distillation approach requires far less compute than retraining from scratch, unlike systems such as SANA-Video that require large-scale retraining. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
The Attention Recovery Rate metric provides a principled, data-driven mechanism for layer selection without exhaustive combinatorial search. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion