Curvature-Aware Optimizers

Cross-source consensus on Curvature-Aware Optimizers from 1 sources and 5 claims.

1 sources · 5 claims

How it works

Curvature methods must be evaluated on both token efficiency and wall-clock efficiency because their steps can be more expensive. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Curvature-aware optimizers attempt to improve update geometry by approximating second-order information with practical structure. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Shampoo uses matrix statistics and inverse matrix roots to provide richer preconditioning than diagonal adaptivity. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Curvature-aware optimizers add computation and hyperparameters through inverse roots, Hessian estimates, damping, clipping, and schedules. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Sophia uses lightweight diagonal Hessian-like curvature estimates for language-model pretraining. — Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers