How abundant are good interpolators?
2026-06-04 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied a special set of linear classifiers that perfectly fit a given dataset, possibly allowing some negative margin. They looked at two common data models and found that, when the number of data points grows proportionally with the dimension but stays relatively small, most classifiers in this set behave similarly in terms of how well they generalize to new data. They also showed that some standard methods for finding classifiers usually do better than almost all others. This suggests these methods avoid most bad fits despite many classifiers fitting the training data perfectly.
linear classifiergeneralization errorlarge deviation principleGaussian mixture modellogistic modeloverparameterizationinterpolating classifierempirical risk minimizationgradient descent
Authors
August Y. Chen, Ahmed El Alaoui
Abstract
Let $S$ be the set of unit norm linear classifiers $θ\in \mathbb{R}^d$ which correctly classify every point of a labeled dataset $(X_i,y_i)_{i=1}^n$, $X_i \in \mathbb{R}^d$, $y_i \in \{-1,+1\}$, with a possibly negative margin $κ$ fixed in advance. Under two natural data-generating distributions of the $(X,y)$ pairs -- a Gaussian mixture model and a logistic model with Gaussian features -- and in the proportional regime $n/d \to α$ with small enough $α$, we establish a large deviation principle on the event that a point $θ$ chosen uniformly at random from $S$ achieves a given generalization error, with high probability over the choice of the data. The associated large deviation rate function is deterministic and describes the proportion, at the exponential scale in $d$, of interpolating classifiers having a given desired performance. As a consequence, we establish the following concentration phenomenon: all but an exponentially small fraction of interpolating classifiers have approximately the same generalization performance given by the unique maximizer of this rate function. We numerically compare this maximizer to the performance of empirical risk minimization by gradient descent and to the performance of a natural linear program, both finding a point in $S$, and deduce that in the overparametrized regime of small $α$, these efficient procedures outperform the vast majority of interpolators, pointing to their nontrivial benign overfitting in this setting.