Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
2026-02-13 • Robotics
RoboticsComputer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors developed a way for robots to learn how to pick up objects and then move them by watching videos of humans. They note that while videos help with learning movements after gripping, they don’t teach robots the best way to grasp objects, especially since robot hands differ from human hands. Their solution, called Perceive-Simulate-Imitate (PSI), uses simulation to test which grasps work best for each task and teaches the robot accordingly. Their method lets robots learn complex skills without needing real robot practice data and improves the robot's success at manipulation tasks.
prehensile manipulationgrasp generationmodular policyrobot learningsimulationhuman video motion datatask-oriented graspingtrajectory filteringrobot manipulation
Authors
Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma
Abstract
The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.