Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

2026-04-27 • Robotics

Robotics

AI summaryⓘ

The authors created a big dataset from human videos showing hands doing tasks and used it to teach robots how to understand and imitate human hand movements. They designed MoT-HRA, a system that breaks down this learning into three parts: understanding the 3D path of movement, guessing human intentions through hand motion, and converting these intentions into robot actions. Their approach helps robots copy human-like hand motions better and handle new situations more reliably. Tests showed it works well both in simulations and with real robots.

human demonstration datasetsrobot manipulationvision-language modelshand motion modelingMANO hand model3D trajectory predictionrobot controlhierarchical learningdistribution shiftlatent motion priors

Authors

Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

Abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

View PDFOpen arXiv