ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

2026-03-03Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors created ULTRA, a system that helps humanoid robots move and interact with their environment more flexibly and safely. They developed a way to convert human motion data into robot movements that respect physical rules, especially during contact with objects. Then, they trained a controller that can follow detailed motions or simple high-level goals using different types of sensors, even noisy visual inputs. Their approach allows robots to act autonomously and adapt to new tasks without needing exact motion instructions. Tests in simulation and on a real robot showed ULTRA performs better than older methods that only follow preset motions.

humanoid robotsneural retargetingmotion capturewhole-body loco-manipulationmultimodal controllerreinforcement learningegocentric perceptionphysics-based simulationgoal-conditioned control
Authors
Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui
Abstract
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.