PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding
2026-06-04 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created PAR3D, a new system that helps computers better understand 3D scenes by recognizing not just whole objects but also their individual parts. They made a special dataset called ScenePart, which includes 3D scenes labeled with detailed part information and language instructions to help train their model. Their approach improves the computer's ability to answer questions about and locate specific parts within objects, while still performing well on tasks involving whole objects. This helps machines interact more precisely with 3D environments.
3D multimodal large language models3D scene understandingpart-level segmentationvisual question answeringreferring segmentation3D representation learninghierarchical segmentationsynthetic datasetembodied interactionlanguage instructions
Authors
Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao
Abstract
Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.