UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new method to understand and recreate 3D objects that can move, like robots or virtual characters, using just text or images. Their technique uses a group of intelligent agents that discuss and reason about the object’s parts, how they move, and what hidden shapes are revealed when these parts move. These agents use videos they generate themselves to help show the inside shapes of objects that can’t be seen from a single picture. This approach helps build detailed 3D models that include motion and hidden structures, improving on earlier methods that had trouble with missing information or limited data.

articulated 3D objectsembodied AIarticulation parametersagentic reasoningvideo generative priorhidden geometrymotion-consistent reconstructionvision-language modelsoccluded interiorsinteractive environments
Authors
Mohamed el amine boudjoghra, Ivan Laptev, Angela Dai
Abstract
Articulated 3D objects are essential for interactive environments in embodied AI, robotics, and virtual reality, but reconstructing their structure and motion from sparse observations remains challenging. Existing approaches remain largely constrained by lack of supervised data or lack the priors needed to reliably recover articulation, hidden geometry, and internal object structure. We present the first debate-driven agentic approach to articulated 3D object reconstruction from text or image inputs that both grounds articulation reasoning in concrete motion and exposes the occluded geometry revealed under articulation. High-level agents reason about object semantics and motion using knowledge from vision-language and video models, while low-level agents estimate articulation parameters and interaction points; together, they engage in a two-round structured debate that first exploits global--local disagreement and then grounds the agents in freely generated video. The same video prior, conditioned on the agreed articulation, then drives each part through its motion to expose occluded interiors and geometry that cannot be inferred from a single static view. By combining agentic reasoning with a video generative prior, our approach jointly infers articulation and reconstructs complete 3D articulated objects, producing high-fidelity geometry, internal structure, and motion-consistent states beyond directly observed surfaces.