Expansion Initialization

Cross-source consensus on Expansion Initialization from 1 sources and 5 claims.

1 sources · 5 claims

How it works

At expansion boundaries, old experts are retained while new experts and router rows are Gaussian initialized. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
Optimizer states are reset at expansion to avoid Adam moment-buffer dimension mismatches when new expert rows are added. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
Initialization choices changed transient expansion spikes more than final loss. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
Carrying optimizer state across expansion gave only marginal benefit that disappeared after about 500 warmup steps. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
Copying from old checkpoints produced the largest initial spike but stabilized fastest. — EMO: Frustratingly Easy Progressive Training of Extendable MoE