LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation
2026-06-22 • Robotics
Robotics
AI summaryⓘ
The authors present LaST-HD, a new method that helps robots learn how to manipulate objects by understanding human hand movements in a shared hidden space, rather than just copying the exact hand motions. They create a model that links human and robot actions based on how things move and interact physically, allowing robots to better adapt to new tasks. To collect human movement data, they designed a cost-effective glove called Out-of-Lab (OOL) Glove. Using this glove and their shared-space training, their approach improves robot learning efficiency and accuracy, even with limited new human data.
Human-hand demonstrationsRobot learningKinematic correspondenceLatent reasoning spaceAction-conditioned world modelCross-embodiment alignmentMotion-capture gloveMixed co-trainingOnline correction
Authors
Jiaming Liu, Yinxi Wang, Chenyang Gu, Siyuan Qian, Xiangju Mi, Hao Chen, Jiawei Chen, Qingpo Wuwu, Xiaoqi Li, Nuowei Han, Yiming Zhang, Xuheng Zhang, Yang Yue, Yeqing Yang, Lei Wang, Peng Jia, Hao Tang, Shanghang Zhang
Abstract
Human-hand demonstrations provide a direct and scalable source of physical interaction data for robot learning. While manual retargeting is indispensable for establishing kinematic action correspondence across different morphologies, robust transfer requires going beyond geometry to address the underlying alignment of physical dynamics between human and robot manipulation. To address this, we introduce LaST-HD, a novel human-to-robot action learning paradigm that extends reasoning-before-acting VLA by aligning human-hand and robot demonstrations in a shared latent reasoning space. Rather than mimicking human kinematics, LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets. After aligning cross-embodiment representations in this shared forward-dynamics space, these targets supervise LaST-HD's latent reasoning process, enabling it to internalize shared physical dynamics and drive efficient human-hand action learning. Moreover, we develop Out-of-Lab (OOL) Glove, a low-cost motion-capture glove tailored to LaST-HD for human-hand data collection. The captured human data provide precise keypoints and serve as universal action supervision across grippers and dexterous hands. Armed with the aligned latent space and high-fidelity human-hand data, we develop a progressive mixed-to-human training recipe comprising mixed human-robot co-training and human-hand online correction post-training. Through mixed co-training, LaST-HD improves generalization to novel objects, scenes, and positions using only human-hand demonstrations. With online correction, LaST-HD further adapts to novel environments and achieves over 90\% accuracy using only 20 minutes of OOL glove data.