Point Tracking Improves World Action Models

2026-05-22Robotics

Robotics
AI summary

The authors created JOPAT, a model that helps robots learn by predicting both what they see and how things move, instead of just focusing on raw images. This approach uses tracking points to better understand motion over time, even when objects are hidden or partly out of view. Their method works better than traditional image-based models, especially on tasks where objects move a lot or get occluded. They tested it on robotic tasks and found improved performance.

robot policy learningworld-action modellatent visual observationsdenoising diffusion transformer2D point tracksocclusionlong-horizon dynamicspixel-level predictionrepresentation learningobject interaction
Authors
Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, Juho Kannala
Abstract
Robot policy learning benefits from world-action models that capture environment dynamics, but pixel-level prediction entangles dynamics with nuisance factors such as lighting and texture, making learned representations vulnerable to task-irrelevant visual variation. We propose JOPAT, a JOint Pixel-And-Track World-Action Model that predicts latent visual observations, 2D point tracks with visibility, and actions in a single denoising diffusion transformer. The key insight is that tracks provide an explicit representation of motion that captures long-horizon dynamics and remains robust under occlusion or partial out-of-frame motion, offering greater utility than modeling pixel appearance alone. On LIBERO and real-world LeRobot tasks, JOPAT improves over pixel-based baselines, with the largest gains on long-horizon tasks involving occlusion, object interaction, and off-screen motion.