Matrix-Based Optimizers

Cross-source consensus on Matrix-Based Optimizers from 1 sources and 6 claims.

1 sources · 6 claims

How it works

Matrix-based optimizers treat Transformer weights as matrices rather than as independent scalar coordinates. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Muon forms a momentum matrix and applies approximate orthogonalization before updating weights. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Matrix-based updates transform an update matrix using operations such as normalization, whitening, or orthogonalization. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Matrix-based optimizers usually require hybrid parameter grouping because biases, normalization parameters, scalar parameters, and small tensors are less natural candidates. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Muon differs from AdamW, Lion, GaLore, and Shampoo by focusing on matrix-level update geometry rather than coordinate scaling, sign normalization, low-rank memory compression, or curvature preconditioning. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
The theoretical basis for matrix-based optimizers remains incomplete, including questions about orthogonalized update geometry and shard-level orthogonalization. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers