Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

2026-02-13 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors focus on making image segmentation that works through conversation, especially handling questions about how objects are used or their safety, not just where they are. They created a new task called Conversational Image Segmentation and built a dataset named ConverSeg covering various ideas like spatial relations and physical reasoning. They also developed a model, ConverSeg-Net, that combines language understanding with segmentation skills, along with a system to generate training data automatically. Their results show that existing models struggle with this task, but ConverSeg-Net performs much better on their new benchmark and still does well on older tasks.

conversational image segmentationreferring image groundingspatial relationsphysical reasoningaffordancesimage segmentationlanguage-guided segmentationbenchmark datasetprompt engineeringAI data generation

Authors

Aadarsh Sahoo, Georgia Gkioxari

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/

View PDFOpen arXiv