A Simple Baseline for Streaming Video Understanding
2026-04-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors show that a very simple method, which only looks at the most recent few video frames, works as well or better than many complicated models designed for understanding long video streams. They call this method SimpleStream and tested it against many other models on two big benchmarks, where it performed strongly despite its simplicity. They also found that having more memory of past frames isn't always better and depends on the type of video model being used. The authors suggest that future tests should clearly separate tasks that need short-term understanding from those needing long-term memory to better measure real improvements.
streaming video understandingsliding-windowvideo language models (VLM)SimpleStreammemory mechanismsshort-term contextlong-term memoryvideo benchmarksreal-time perceptionrecall
Authors
Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu
Abstract
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.