ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors present ActiveSAM, a method that makes the Segment Anything Model 3 (SAM 3) more efficient for identifying many different object classes in images without extra training. Instead of analyzing every possible class at full image detail, ActiveSAM quickly guesses which classes are present at a lower resolution and only fully processes those. This approach saves time and still improves accuracy compared to previous methods, especially when dealing with lots of classes or noisy images. It works without needing any new training or class labels for each dataset.

Segment Anything Model (SAM)open-vocabulary semantic segmentationzero-shot inferenceimage segmentationprompt engineeringactive set selectionbackground calibrationimage corruption robustnessautonomous drivingembodied AI

Authors

Tran Dinh Tien, Zhiqiang Shen

Abstract

Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at https://github.com/VILA-Lab/ActiveSAM.

View PDFOpen arXiv