Memory-Efficient Optimizers
Cross-source consensus on Memory-Efficient Optimizers from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Risks & contraindications
Highlighted claims
- Adafactor reduces second-moment memory for matrix parameters by replacing the full matrix with row and column accumulators. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Memory-efficient optimizers may be more valuable for enabling larger models, longer contexts, larger microbatches, or full-parameter fine-tuning than for lowering loss on an unchanged model. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Memory-efficient optimizers reduce optimizer-state costs using factorization, quantization, grouping, projection, or fused updates. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- Reducing optimizer state can change which training regimes are feasible under a given hardware budget. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- The measured memory advantage of memory-efficient methods can change under sharding, tensor parallelism, and custom kernels. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers