Expansion Initialization
Cross-source consensus on Expansion Initialization from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Preparation
Comparisons
Highlighted claims
- At expansion boundaries, old experts are retained while new experts and router rows are Gaussian initialized. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- Optimizer states are reset at expansion to avoid Adam moment-buffer dimension mismatches when new expert rows are added. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- Initialization choices changed transient expansion spikes more than final loss. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- Carrying optimizer state across expansion gave only marginal benefit that disappeared after about 500 warmup steps. — EMO: Frustratingly Easy Progressive Training of Extendable MoE
- Copying from old checkpoints produced the largest initial spike but stabilized fastest. — EMO: Frustratingly Easy Progressive Training of Extendable MoE