VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

2026-06-29 • Robotics

RoboticsArtificial IntelligenceGraphics

AI summaryⓘ

The authors created a way to train a humanoid robot to move and interact with objects by using a large set of synthetic data. They built 3D models of indoor spaces and generated example movements and camera views for the robot to learn from, without needing humans to label anything. Their system trains a model to predict whole-body motions from visual inputs and commands, which can then be executed on a real robot. Testing showed that this approach helped the robot perform navigation and object transport tasks in the real world.

Humanoid robotEgocentric visionLoco-manipulation3D reconstructionGaussian SplattingSim-to-real transferKinematic trajectoriesVision-language modelsIndoor navigationRobot learning

Authors

Yen-Jen Wang, Jiaman Li, Sirui Chen, Takara E. Truong, Pei Xu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, Karen Liu

Abstract

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid. We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation. Project Website: https://vision-language-kinematics.github.io/

View PDFOpen arXiv