TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

2026-06-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed TimeProVe, a new method to answer questions about very long videos by first generating possible answers and evidence using simple tools, then checking only the important parts with expensive models. This approach saves a lot of computational cost compared to analyzing the whole video densely. They also created OpenTSUBench, a benchmark to test how well models can find evidence in daily activity videos. Their method performed better than strong previous methods while using much less computing power. It also showed good results on another video dataset, especially when combined with models that focus on finding exact times within videos.

Long Video Question AnsweringVision-Language ModelsTemporal GroundingAction RecognitionLarge Language ModelsBenchmarkActivities of Daily LivingSparse ReasoningVideo Inference CostCharades-STA
Authors
Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das
Abstract
Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.