Edge of Stability Selectively Shapes Learning Across the Data Distribution

2026-06-02Machine Learning

Machine Learning
AI summary

The authors found that the edge of stability (EoS), a concept usually seen as a general limit in training models, actually changes how learning happens for different parts of the training data. They showed that some groups of data make faster progress while others slow down depending on their alignment with the main direction that influences training adjustment (top Hessian eigenvector) and if their learning signals stay strong over time. They confirmed this by tweaking training so these conditions break, which stopped the advantage. Their work reveals EoS not only marks a stability point but also controls which parts of the data get learned more.

Edge of StabilityOptimizationHessian EigenvectorGradient AlignmentCross-Entropy LossGradient SaturationMachine Learning TrainingGradient MagnitudeLearning Dynamics
Authors
Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano
Abstract
Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.