Pretraining Performance

Cross-source consensus on Pretraining Performance from 1 sources and 5 claims.

1 sources · 5 claims

Benefits

In the main 1.92T-token experiment, EMO reached a final pretraining loss of 1.017. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
EMO remained behind the fixed E=128 baseline by 0.023 absolute loss and 2.3 percent relative loss. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
EMO reduced GPU hours by 10 percent while approaching the quality regime of fixed E=128. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
Expansion-induced loss spikes recovered within roughly 10,000 steps. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
The final EMO model did not fully match fixed E=128 in all pretraining and downstream results. — EMO: Frustratingly Easy Progressive Training of Extendable MoE