Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

2026-05-11 • Machine Learning

Machine Learning

AI summaryⓘ

The authors study how tokens (the pieces of text processed by transformers) change during inference in deep encoder-only transformer models. They use math from particle systems to show that, over time, the distribution of these tokens becomes very close to a certain projection defined by the transformer's internal matrices (key, query, and value). Their theory predicts this concentration happens quickly depending on a temperature-like parameter and the inference time. They also do experiments confirming their math and find that at even longer times, the token behavior changes again, influenced mainly by the value matrix.

TransformerSelf-attentionEncoder-only transformerToken distributionMean-field limitWasserstein distanceLyapunov estimatesProjection mapLaplace principleValue matrix

Authors

Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith

Abstract

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

View PDFOpen arXiv