Mixture-of-Experts Transformers
Cross-source consensus on Mixture-of-Experts Transformers from 1 sources and 4 claims.
1 sources · 4 claims
How it works
Benefits
Highlighted claims
- Sparse MoE transformers replace dense feed-forward layers with multiple experts while routing each token to only a small subset. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- With top-k routing, per-token FLOPs depend mainly on activated experts and activated parameters rather than the full expert pool. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- MoE architectures can increase total parameters without proportionally increasing theoretical computation. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- The MoE layer combines the outputs of the top-k selected experts using normalized router weights. — EMO: Frustratingly Easy Progressive Training of Extendable MoE