AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

2026-03-24 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the task of segmenting a specific object in a video based on a natural language description, called Referring Video Object Segmentation (RVOS). Instead of having the language model decide when to look at objects over time without detailed info, their method uses a perception model (SAM3) to first find object tracks across the whole video. Then, the language model uses this object information to figure out which one matches the query, improving accuracy. Their approach, AgentRVOS, works well without needing extra training and outperforms other similar training-free methods on several datasets.

Referring Video Object SegmentationMultimodal Large Language ModelSAM3Mask TracksVideo SegmentationTraining-free MethodsNatural Language QueryTemporal ReasoningObject-level Evidence

Authors

Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim

Abstract

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

View PDFOpen arXiv