AI summaryⓘ
The authors introduce GaussDet, a new method to improve 3D scene understanding by breaking down scenes into meaningful objects without relying heavily on complex language-image features. Instead, they use 2D object detectors that can understand more detailed descriptions and combine information from multiple views to label objects more accurately. This approach helps reduce mistakes from grouping errors and supports better recognition of objects based on complex language queries. Their evaluations show GaussDet performs better than previous methods, especially in identifying objects referred to by detailed expressions without prior training. Overall, the authors provide a way to more reliably connect language instructions to 3D scenes.
3D Gaussian SplattingOpen-vocabulary segmentationReferring expression groundingContrastive Language-Image Pretraining (CLIP)Instance grouping2D object detectionZero-shot learningMulti-view aggregationSemantic labelingEmbodied AI
Authors
Jameel Hassan, Yasiru Ranasinghe, Vishal Patel
Abstract
3D Gaussian Splatting (3DGS) has emerged at the forefront of 3D scene reconstruction. Extending 3DGS with language-driven, open-vocabulary understanding has gained significant attention for real-world applications such as embodied AI. Recent methods achieve this by learning an instance feature attribute and assigning semantics by distilling high-dimensional Contrastive Language-Image Pretraining (CLIP) features directly into the scene representation. However, the instance grouping mechanisms of these methods either require a predefined number of instances or suffer from noise in their bottom-up grouping strategies. Furthermore, the reliance on CLIP restricts semantic understanding to simple noun phrases, preventing complex spatial reasoning and referential expression grounding. We present GaussDet, a method that circumvents the need for dense CLIP features by leveraging discrete, open-vocabulary 2D object detectors with referring expression capabilities. We learn instance features for individual Gaussians to decompose the scene into 3D instance groups. By rendering these groups and aggregating semantic votes from multi-view 2D detections, we generate a robust View-Aggregated Semantic Label Distribution (VASD) for each 3D instance. This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping. Our approach enables a straightforward, zero-shot extension from simple language queries to complex referential grounding. Extensive evaluations across two key tasks -- open-vocabulary segmentation (LeRF-OVS, ScanNet) and referring expression grounding (Ref-LeRF) -- demonstrate that GaussDet achieves consistent improvements over existing methods. Most notably, we achieve a substantial 16.7% mIoU improvement in referential grounding within a strict zero-shot setting.