CAST: Modeling Visual State Transitions for Consistent Video Retrieval

2026-03-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors notice that current methods for finding relevant video clips focus only on local content and ignore how characters and scenes stay consistent over time. They create a new task called Consistent Video Retrieval (CVR) to handle this problem and build a benchmark using several cooking and task videos. Their method, CAST, adds a small module that considers the history of the video to predict how the scene or state changes over time. Tests show CAST improves performance on multiple datasets and helps produce smoother video sequences when used to rerank video generation results.

video retrievalcontext-awarestate consistencyvision-language modelsbenchmarklong-form videovideo generationembedding spacesrerankinginductive bias
Authors
Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao
Abstract
As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.