Second-Order Path Kernel Interpolation Formulas in Machine Learning
2026-06-05 • Machine Learning
Machine Learning
AI summaryⓘ
The authors build on a previous formula by Pedro Domingos that explains how neural networks make predictions based on their training data. They improve this by introducing a second-order version that includes the effect of the curvature of the network’s optimization path. Their work also studies how noisy training methods like stochastic gradient descent add extra factors to the prediction formula. Additionally, they explore how momentum in training changes the formula without breaking its overall structure. Finally, they provide estimates on how much the actual predictions can vary around this refined formula.
neural networksgradient descentstochastic gradient descentmomentumpath kernelinterpolation formulaoptimization pathmodel curvaturegradient noiseconcentration estimate
Authors
Jin Guo, Roy Y. He, Jean-Michel Morel
Abstract
Understanding how training data shape neural network predictions is a central problem in modern learning theory. In 2020, Pedro Domingos proposed an interpolation formula valid for every model learned by deterministic gradient descent. It expresses the model's prediction as an integral, along the optimization path, of a data-dependent kernel that aligns the model's gradients at the test and training data. Such a first-order characterization remains valid for models trained with batch-based stochastic optimization. In this paper, we develop second-order forms of these interpolation formulas. We show that the leading path-kernel interpolation is supplemented by a curvature-weighted interpolation term. For stochastic gradient descent, an additional sampling-induced component appears, coupling the curvature of the prediction with the covariance of mini-batch gradient noise. We also extend the representation to stochastic gradient descent with momentum, where the interpolation structure is preserved but with the weights modified by a memory-related factor. Moreover, we establish a concentration estimate for the terminal prediction, identifying the fluctuation scale around the expected second-order representation. Together, these results provide a refinement of the path-kernel interpretation of neural network prediction.