Sparsity-Aware Scaling Law

Cross-source consensus on Sparsity-Aware Scaling Law from 1 sources and 5 claims.

1 sources · 5 claims

How it works

EMO uses a scaling law that explicitly includes expert count to allocate tokens across expansion stages. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
A dense-style scaling law cannot distinguish MoE models that share the same activated size but have different expert pools. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
The fitted law estimates compute-optimal cumulative token counts for each expert count at a fixed activated size. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
The fitted scaling law achieved high fit quality on training data and low held-out error. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
The scaling-law schedule placed expansion in a favorable quality-cost region around 45 percent of the token budget. — EMO: Frustratingly Easy Progressive Training of Extendable MoE