Sparsity-Aware Scaling Law
Cross-source consensus on Sparsity-Aware Scaling Law from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Comparisons
Evidence quality
Highlighted claims
- EMO uses a scaling law that explicitly includes expert count to allocate tokens across expansion stages. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- A dense-style scaling law cannot distinguish MoE models that share the same activated size but have different expert pools. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- The fitted law estimates compute-optimal cumulative token counts for each expert count at a fixed activated size. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- The fitted scaling law achieved high fit quality on training data and low held-out error. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- The scaling-law schedule placed expansion in a favorable quality-cost region around 45 percent of the token budget. — EMO: Frustratingly Easy Progressive Training of Extendable MoE