Matrix-Based Optimizers
Cross-source consensus on Matrix-Based Optimizers from 1 sources and 6 claims.
1 sources · 6 claims
How it works
Risks & contraindications
Comparisons
Evidence quality
Highlighted claims
- Matrix-based optimizers treat Transformer weights as matrices rather than as independent scalar coordinates. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Muon forms a momentum matrix and applies approximate orthogonalization before updating weights. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Matrix-based updates transform an update matrix using operations such as normalization, whitening, or orthogonalization. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Matrix-based optimizers usually require hybrid parameter grouping because biases, normalization parameters, scalar parameters, and small tensors are less natural candidates. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Muon differs from AdamW, Lion, GaLore, and Shampoo by focusing on matrix-level update geometry rather than coordinate scaling, sign normalization, low-rank memory compression, or curvature preconditioning. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- The theoretical basis for matrix-based optimizers remains incomplete, including questions about orthogonalized update geometry and shard-level orthogonalization. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers