Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

2026-04-29Machine Learning

Machine Learning
AI summary

The authors show that the step-by-step changes in tokens inside a transformer model with certain neural network blocks behave like particles moving randomly but influencing each other, in a continuous-time setting. They find an equation that describes how the overall distribution of these tokens evolves and prove that as the number of tokens gets very large, their behaviors become independent in a specific way. Their results include precise error bounds and show that the order of limits they consider does not matter. They also demonstrate that the randomness can help the tokens synchronize, reducing interaction energy over time when the noise is strong enough, and identify which activation functions support this behavior.

Transformer modelMultiLayer Perceptron (MLP)Stochastic interacting particle systemStochastic partial differential equation (SPDE)Propagation of chaosSelf-attentionSynchronization by noiseActivation functionContinuous-time limitInteraction energy
Authors
Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García, Samuele Saviozzi, Marco Romito
Abstract
We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.