AdamW

Cross-source consensus on AdamW from 1 sources and 5 claims.

1 sources · 5 claims

Uses

AdamW's main limitation is the memory required to store optimizer state for every trainable parameter. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
AdamW remains the dominant baseline for contemporary LLM pretraining and fine-tuning because it combines adaptive moments with decoupled weight decay. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
AdamW performance depends on tuning choices and implementation details rather than being a single fixed baseline. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
AdamW avoids coupling regularization with adaptive preconditioning by applying weight decay separately from the adaptive denominator. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
AdamW remains hard to beat when it is strongly tuned and supported by mature implementations. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers