AI summaryⓘ
The authors show that Vision Transformers (ViTs), which usually compare every part of an image to all other parts (a process that becomes very slow with high-resolution images), can still work well without direct interactions between every patch. They introduce VECA, a new model that uses a small set of learned 'core' tokens to help different parts of the image communicate, reducing the computation from slowing down much as the image gets bigger. This approach keeps all the image patches updated but only lets them interact through these cores, making the process faster and more efficient. Their experiments show that VECA performs similarly to recent top models while using less computing power. This suggests an alternative way to build ViTs that scales better for large images.
Vision TransformersSelf-attentionComputational complexityPatch tokensCore-periphery attentionCross-attentionLinear scalingModel efficiencyVisual-semantic representationsImage classification
Authors
Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo
Abstract
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.