Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

2026-02-20Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors focus on improving how computers understand and answer questions about long video streams. They found that existing methods lose important visual details because they use too few tokens (small pieces of information) per video frame. To fix this, they increased the number of tokens to better capture details and designed smart ways to pick useful parts of the video while avoiding too much repetition. Their new approach, called MemStream, improves accuracy on several video question-answering tests.

streaming video understandingvideo question answeringtoken budgetkey-value cachingspatiotemporal informationquery-frame similarityadaptive selectionretrieval mixture-of-expertsMemStreamQwen2.5-VL-7B
Authors
Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava
Abstract
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.