When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

2026-05-07 • Machine Learning

Machine LearningArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors study sign-based optimization methods like SignSGD and Muon, which often perform better than traditional SGD when training big models, but lacked clear theoretical explanation. They show that by focusing on different mathematical measures (like the L1 norm instead of L2) and assuming noise is sparse and affects coordinates separately, SignSGD can be proven to work more efficiently than SGD in certain settings. They extend these results to matrix problems with the Muon optimizer and confirm their theory by testing it on a real GPT-2 model, where SignSGD converged faster. This work helps explain when and why these sign-based methods have an advantage.

SignSGDMuon optimizerstochastic gradient descent (SGD)L1 normL2 normsparse noisecoordinate-wise updatessmoothnessstationary pointsfoundation models

Authors

Hongyi Tao, Dingzhi Yu, Lijun Zhang

Abstract

Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.

View PDFOpen arXiv