PointAction: 3D Points as Universal Action Representations for Robot Control
2026-06-02 • Robotics
Robotics
AI summaryⓘ
The authors developed PointAction, a method that helps robots understand and act on video predictions by adding 3D point data to the usual RGB video frames. This approach captures both the movement and shape of important objects in 3D over time, making it easier to translate video into real robot actions. By using this structured 4D data, their method reduces confusion in interpreting 2D video information and works across different robot types and tasks with less training. Tests show PointAction is better at predicting 3D scenes and controlling robots than previous methods, including on real robots the system hadn't seen before.
Video-Action ModelsVideo Diffusion Models3D Point Clouds4D ModelingRobot ManipulationAction GroundingRobot EmbodimentsDiffusion-Based Action DecoderMetric 3D MotionSpatial Constraints
Authors
Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu
Abstract
Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.