Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping

2026-03-03 • Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a method called Tether that helps robots learn by playing and trying tasks on their own, without needing lots of human help. They use a smart way to adjust a few example actions to new scenes, making the robot's learning efficient and flexible. Their system also chooses tasks to do, checks how well it did, and keeps getting better all by itself, thanks to visual language models. This lets the robot collect lots of good data and improve over time, eventually matching expert-level performance using only a small number of initial demonstrations.

robot learningimitation learningopen-loop policysemantic keypointsvision-language modelsautonomous playtask executiondata efficiencyrobotic manipulationclosed-loop policies

Authors

William Liang, Sam Wang, Hung-Ju Wang, Osbert Bastani, Yecheng Jason Ma, Dinesh Jayaraman

Abstract

The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.

View PDFOpen arXiv