SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
2026-04-09 • Machine Learning
Machine Learning
AI summaryⓘ
The authors found that the commonly used AdamW optimizer takes up a lot of memory when training large language models, especially due to the embedding layers that have tricky gradient patterns. To fix this, they created SAGE, a new optimizer that uses a clever way to adjust gradient updates with less memory but still handle those tricky embedding gradients well. Their approach mixes ideas from another optimizer called Lion but adds a special, memory-friendly scaling method that keeps updates stable. When tested on Llama models, their method achieved better accuracy while using less memory than previous optimizers.
AdamW optimizerembedding layergradient varianceoptimizer statememory efficiencyLion optimizeradaptive scalingperplexitylarge language models (LLMs)SAGE optimizer
Authors
Wooin Lee, Hyun-Tae Kim
Abstract
The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.