InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors found that current models for understanding images and videos with text usually focus on the whole scene but miss details about individual objects. They created InstAP, a new training method that teaches the model to connect specific words to exact objects or regions in images and videos. To do this, they made a big dataset called InstVL with both overall scene descriptions and detailed labels for individual objects. Their approach helps the model better find and describe specific objects, and also improves general understanding of videos without extra training. Visual tests show their method is better at pointing out the right object compared to older models.

vision-language pre-traininginstance-level reasoningcontrastive alignmentspatial-temporal groundingdatasetinstance retrievalzero-shot learningvideo benchmarksattention mechanismsglobal vs. local features
Authors
Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang, Betty Le Dem, Norimasa Kobori, Quan Kong
Abstract
Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.