Tango: Taming Visual Signals for Efficient Video Large Language Models

2026-04-10 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors studied ways to make Video Large Language Models (LLMs) faster by cutting down the number of video tokens they analyze. They found that existing methods for choosing which tokens to keep either miss important patterns or create confusing groups of tokens. To fix this, they created a method called Tango that picks tokens more smartly and keeps track of where things are in the video. Their tests show Tango keeps most of the original model’s accuracy while making it run almost twice as fast when using only 10% of the tokens.

Video Large Language Modelstoken pruningattention-based selectionsimilarity-based clusteringtop-k selectionspatial distributionSpatio-temporal Rotary Position Embeddingtoken clusteringvideo understanding benchmarks

Authors

Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen

Abstract

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.

View PDFOpen arXiv