Vision Transformers Need Better Token Interaction

2026-05-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors looked at why Vision Transformers (ViTs) get worse at tasks needing detailed, dense predictions after long training, even though they do well on whole images. They found that it's not just about certain high-value patches messing things up but that global information spreads too much across patches, called semantic diffusion. Instead of removing global context, they suggest making patch interactions more selective by using a type of sparse attention called entmax-1.5. This tweak keeps the model good at classifying images and improves performance on tasks like semantic segmentation. Overall, the authors show that careful control of how patches communicate helps ViTs handle detailed tasks better.

Vision TransformersDense predictionSemantic segmentationSemantic diffusionSparse attentionEntmaxImageNetPatch tokensLinear probingGlobal context

Authors

Linxiang Su

Abstract

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.

View PDFOpen arXiv