Distillation
Cross-source consensus on Distillation from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Dosage & preparation
Preparation
Comparisons
Highlighted claims
- ARL2 is converted from a pretrained softmax model through a progressive two-stage distillation pipeline that trains fewer than 2% of backbone parameters. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- The full two-stage distillation pipeline consumes approximately 156 H100 GPU-hours. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- The sensitivity-guided layer selection framework protects Hybrid-Sensitive layers from replacement to avoid irreversible imaging quality degradation, while allowing Hybrid-Recoverable gaps to close in joint distillation. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- ARL2's distillation approach requires far less compute than retraining from scratch, unlike systems such as SANA-Video that require large-scale retraining. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
- The Attention Recovery Rate metric provides a principled, data-driven mechanism for layer selection without exhaustive combinatorial search. — Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion