Mixture-of-Experts Transformers

Cross-source consensus on Mixture-of-Experts Transformers from 1 sources and 4 claims.

1 sources · 4 claims

How it works

Sparse MoE transformers replace dense feed-forward layers with multiple experts while routing each token to only a small subset. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
With top-k routing, per-token FLOPs depend mainly on activated experts and activated parameters rather than the full expert pool. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
MoE architectures can increase total parameters without proportionally increasing theoretical computation. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
The MoE layer combines the outputs of the top-k selected experts using normalized router weights. — EMO: Frustratingly Easy Progressive Training of Extendable MoE