Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization
2026-06-05 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study a way for many computers (agents) connected in a network to work together to solve a problem when each only talks to its neighbors, without a central boss. They focus on a method that speeds up learning by mixing clever averaging of information (gossip) with advanced math techniques (Nesterov extrapolation). Their new method, called Multi-Gossip Accelerated DSGD, improves how quickly the network reaches consensus and reduces randomness in updates at the same time. They prove this leads to the fastest communication needed so far for this kind of problem, balancing the problem difficulty and network structure.
decentralized optimizationstochastic gradient descentNesterov accelerationgossip algorithmsstrong convexitycondition numberspectral gapgradient variancecommunication complexityconsensus
Authors
Ming Sun, Kun Yuan
Abstract
Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the condition number \(κ=L/μ\) and the network spectral gap \(1-β\). Although deterministic decentralized methods can simultaneously achieve accelerated \(\sqrtκ\) and \(1/\sqrt{1-β}\) dependences, no existing stochastic method attains both improvements at once. In this paper, we propose \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD), a decentralized stochastic algorithm that combines Nesterov-type primal--dual extrapolation with multi-round fast gossip averaging. The key idea is to couple the gossip depth with the mini-batch size so that additional communication rounds simultaneously improve consensus accuracy and reduce gradient variance. We show that MG-ADSGD achieves the communication complexity \[ \widetilde{\mathcal O}\!\left( \frac{σ^2}{μnε}\log\frac{1}ε + \sqrt{\fracκ{1-β}}\log\frac{1}ε \right), \] where \(ε\) denotes the target accuracy, \(n\) is the number of nodes, and \(σ^2\) is the gradient variance. To the best of our knowledge, this bound yields the best currently available communication complexity for decentralized stochastic strongly convex optimization, up to logarithmic factors that are independent of $ε$.