Do as I Do: Dexterous Manipulation Data from Everyday Human Videos
2026-06-17 • Robotics
RoboticsComputer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a method called DO AS I DO that takes regular videos of humans using their hands and turns them into instructions for robot hands to imitate those actions. This method works with simple video types and can handle both first-person and regular views of people interacting with objects. Their approach improves how accurately robot hands can mimic complex human hand movements from just those videos. They tested it against existing methods and found it better at capturing detailed hand-object interactions and creating robot-ready action sequences.
robotic manipulationdexterous handshand-object interactionmonocular RGB videosegocentric visionexocentric visionretargetingrobot learningimitation learningpose estimation
Authors
Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik
Abstract
How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.