Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization
2026-02-23 • Robotics
RoboticsComputer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a new method to create accurate computer models of messy, real-world scenes with multiple objects. Their approach can figure out the shapes and positions of these objects while considering how they physically interact, like touching or resting on each other. They use a special mathematical model that can be optimized all at once and an efficient solver that keeps computing manageable even as more objects are added. Their system combines learning-based guesses with physics-based adjustments to produce scenes ready for simulations. Tests show it works well for scenes with several objects and complex shapes.
real-to-sim scene estimationrigid objectsshape optimizationpose estimationcontact modelaugmented Lagrangiandifferentiable physicsconvex hulllinear system solversimulation-ready
Authors
Wei-Cheng Huang, Jiaheng Han, Xiaohan Ye, Zherong Pan, Kris Hauser
Abstract
Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.