PAct: Part-Decomposed Single-View Articulated Object Generation
2026-02-16 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionRobotics
AI summaryⓘ
The authors present a method to create 3D models of objects made of moving parts, like drawers or doors, from just a single image. Their approach focuses on understanding and generating each movable part separately, including how they fit together and move. Unlike slower methods that need lots of time to fine-tune each model or use pre-made templates, their method quickly produces accurate and controllable 3D objects without extra optimization. Their experiments show it works better and faster than previous ways, keeping the shape and parts consistent with the original image.
articulated objects3D reconstructionpart decompositionkinematic rigginglatent tokenssingle-image generationfeed-forward inferenceembodied AImotion articulation3D object synthesis
Authors
Qingming Liu, Xinyue Yao, Shuyuan Zhang, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia
Abstract
Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.