Memory-Efficient Optimizers

Cross-source consensus on Memory-Efficient Optimizers from 1 sources and 5 claims.

1 sources · 5 claims

How it works

Adafactor reduces second-moment memory for matrix parameters by replacing the full matrix with row and column accumulators. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Memory-efficient optimizers may be more valuable for enabling larger models, longer contexts, larger microbatches, or full-parameter fine-tuning than for lowering loss on an unchanged model. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Memory-efficient optimizers reduce optimizer-state costs using factorization, quantization, grouping, projection, or fused updates. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Reducing optimizer state can change which training regimes are feasible under a given hardware budget. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
The measured memory advantage of memory-efficient methods can change under sharding, tensor parallelism, and custom kernels. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers