OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

2026-03-10Machine Learning

Machine Learning
AI summary

The authors study improvements to the Adam optimizer, focusing on how it averages past gradients (EMA). They point out that previous math explanations for Adam had some limits, like needing fixed step sizes or special assumptions about gradients. To fix this, they propose new variants called OptEMA that adjust how averaging is done in a more flexible, adaptive way, without relying on hard-to-know constants. Their analysis shows these new methods converge reliably under common assumptions, matching or improving rates especially when the noise in the gradients is low. This work helps understand and improve optimization in machine learning without extra tuning.

Exponential Moving Average (EMA)Adam optimizerstochastic gradient descent (SGD)convergence rateadaptive stepsizeLipschitz constantgradient noisesmoothnessfirst-order momentsecond-order moment
Authors
Ganzhao Yuan
Abstract
The Exponential Moving Average (EMA) is a cornerstone of widely used optimizers such as Adam. However, existing theoretical analyses of Adam-style methods have notable limitations: their guarantees can remain suboptimal in the zero-noise regime, rely on restrictive boundedness conditions (e.g., bounded gradients or objective gaps), use constant or open-loop stepsizes, or require prior knowledge of Lipschitz constants. To overcome these bottlenecks, we introduce OptEMA and analyze two novel variants: OptEMA-M, which applies an adaptive, decreasing EMA coefficient to the first-order moment with a fixed second-order decay, and OptEMA-V, which swaps these roles. Crucially, OptEMA is closed-loop and Lipschitz-free in the sense that its effective stepsizes are trajectory-dependent and do not require the Lipschitz constant for parameterization. Under standard stochastic gradient descent (SGD) assumptions, namely smoothness, a lower-bounded objective, and unbiased gradients with bounded variance, we establish rigorous convergence guarantees. Both variants achieve a noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+σ^{1/2} T^{-1/4})$ for the average gradient norm, where $σ$ is the noise level. In particular, in the zero-noise regime where $σ=0$, our bounds reduce to the nearly optimal deterministic rate $\widetilde{\mathcal{O}}(T^{-1/2})$ without manual hyperparameter retuning.