Linear Scaling Video VLMs for Long Video Understanding

2026-05-29Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed StateKV, a new method that helps video models understand long videos more efficiently during inference. Unlike traditional methods that become very slow as videos get longer, StateKV keeps important context in a small, smart memory and uses a separate cache for decoding. This approach works well across different models and benchmarks, achieving accuracy close to the original but with much lower computational cost. Importantly, their method doesn’t need any extra training or changes in the model design.

video vision-language modelsspatiotemporal self-attentionlong-horizon video understandinginference-time optimizationrecurrent statecross-frame contextstreaming video processingcompute efficiencysliding-window approximation
Authors
Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles
Abstract
Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.