MoE Efficiency Paradox
Cross-source consensus on MoE Efficiency Paradox from 1 sources and 4 claims.
1 sources · 4 claims
How it works
Risks & contraindications
Comparisons
Evidence quality
Highlighted claims
- Increasing the expert pool from 8 to 128 increased wall-clock step time by 1.08x for the 1.1B activated-parameter model and 1.72x for the 4B activated-parameter model. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- The MoE efficiency paradox is the mismatch between nearly constant theoretical FLOPs and worsening real training efficiency as expert count grows. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- Wall-clock step time rises with expert count because communication, storage, routing overhead, and small expert computations scale with the full expert pool. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- MoE efficiency should be evaluated using wall-clock costs tied to total expert count, not only activated FLOPs. — EMO: Frustratingly Easy Progressive Training of Extendable MoE