UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

2026-03-24Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose UniFunc3D, a new method for identifying detailed interactive parts in 3D scenes based on natural language instructions. Unlike previous methods that passively choose frames and often miss important details, UniFunc3D actively selects video frames and reasons about the scene all at once without needing special training. This approach helps it focus on both broad context and fine details effectively. Their method improved performance significantly on a benchmark dataset without additional training.

functionality segmentation3D scenesmultimodal large language modelspatial-temporal groundingfine-grained maskstask decompositionframe selectionSceneFun3DmIoUtraining-free methods
Authors
Jiaying Lin, Dan Xu
Abstract
Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.