The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

2026-05-07Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors explain why in large language models, the first few words often get way more attention than others, a problem called the attention sink. They found this happens because of how these models combine information, which causes some parts to have more variation, and this effect gets bigger due to special neurons in certain layers. They tested their idea by changing how attention works and boosting the variation in specific tokens, which helped create attention sinks in different places. Finally, they suggested a new way to normalize parts of the model during training that makes learning faster by keeping things more balanced.

Large Language ModelsAttention SinkSelf-AttentionValue AggregationFeed-Forward NetworkSuper NeuronsVariance DiscrepancyAttention MaskRMSNormPre-training
Authors
Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu
Abstract
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.