MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
2026-03-19 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed MonoArt, a method that can figure out the shape and movement of objects with parts (like doors or arms) from just one picture. Instead of guessing movement directly, their approach breaks down the problem step-by-step, turning the image into a basic 3D shape, identifying parts, and then understanding how those parts might move. This makes the predictions more stable and easier to understand without needing extra videos or complicated steps. Tests showed their method works well and is fast, with potential uses in robotics and understanding scenes with moving parts.
articulated objects3D reconstructiongeometrymotion parameterscanonical geometrypart structureprogressive structural reasoningPartNet-Mobilityrobotic manipulationscene reconstruction
Authors
Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu
Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.