AdamW
Cross-source consensus on AdamW from 1 sources and 5 claims.
1 sources · 5 claims
Uses
How it works
Risks & contraindications
Comparisons
Evidence quality
Highlighted claims
- AdamW's main limitation is the memory required to store optimizer state for every trainable parameter. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- AdamW remains the dominant baseline for contemporary LLM pretraining and fine-tuning because it combines adaptive moments with decoupled weight decay. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- AdamW performance depends on tuning choices and implementation details rather than being a single fixed baseline. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- AdamW avoids coupling regularization with adaptive preconditioning by applying weight decay separately from the adaptive denominator. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
- AdamW remains hard to beat when it is strongly tuned and supported by mature implementations. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers