Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
2026-02-26 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors study a way to improve how computers can identify and separate objects in images based on text descriptions, even for categories they haven't seen before. They point out that current methods struggle because the training labels are too broad and language can be unclear. To fix this, they add a few example images with detailed labels to help the system learn better during testing. Their method mixes information from text and these example images in a smarter way for each picture, which helps the system recognize and segment objects more accurately, even for new or detailed tasks.
open-vocabulary segmentationvision-language modelszero-shot recognitionfew-shot learningpixel-level annotationtext promptstest-time adaptationmultimodal fusionpersonalized segmentationsupport set
Authors
Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.