VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
2026-03-23 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address the problem that current large language models struggle to understand long videos because they can only focus on short parts at a time. They created VideoDetective, a system that looks not just at the question but also at how video parts relate to each other visually and over time. By estimating which segments are important and spreading this information to unseen parts, their method finds the key video sections to answer questions better. Tests show their approach improves accuracy on long video question answering tasks.
multimodal large language modelslong video understandingcontext windowvideo segmentationvisual-temporal affinity graphquery-to-segment relevancehypothesis-verification-refinement loopvideo question answeringVideoMME-long benchmark
Authors
Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/