IntRec: Intent-based Retrieval with Contrastive Refinement

2026-02-19Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed IntRec, a system that helps find objects users ask for in busy scenes by learning from user feedback. It keeps track of what the user has confirmed and rejected to improve its guesses. This approach helps the system better tell apart confusing or similar objects without needing extra training data. Their tests showed IntRec does better than existing methods, especially when the user gives quick feedback.

open-vocabulary object detectioninteractive retrievaluser feedbackcontrastive alignmentpositive anchorsnegative constraintsLVIS datasetobject disambiguationAP (average precision)
Authors
Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.