Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
2026-04-15 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study how the common practice of using momentum in stochastic gradient descent (SGD) changes the way optimization behaves depending on batch size. They find that momentum creates two distinct stability behaviors: for small batches, it makes the training focus on flatter, more stable areas, while for large batches, it encourages sharper solutions like traditional full-batch methods. This shows that momentum impacts the training dynamics in a more complex way than previously thought and helps explain how to better tune hyperparameters.
stochastic gradient descentmomentumbatch sizeoptimization stabilitysharpnessmini-batch gradientshyperparameter tuninglinear stabilitycurvaturefull-batch dynamics
Authors
Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano
Abstract
Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.