Learning Action Priors for Cross-embodiment Robot Manipulation
2026-06-24 • Robotics
RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition
AI summaryⓘ
The authors found that many models combining vision, language, and action struggle because the part controlling motion has to learn from scratch while also understanding visual and language cues. To fix this, they propose training the motion part separately first to learn how actions usually change over time, without looking at images or words. Then, they connect this trained motion knowledge to the full vision-language-action model, which helps it learn faster and perform better, especially when real-world training data is limited. Their approach works well across many different tasks and robot types.
Vision-Language-Action modelsVision-Language Modelaction modulemotion priorcross-modal alignmentflow matchingencoder-decoderlatent distillationcross-embodimentpolicy training
Authors
Dong Jing, Tianqi Zhang, Jiaqi Liu, Jinman Zhao, Zelong Sun, Li Erran Li, Zhiwu Lu, Mingyu Ding
Abstract
Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. In Stage~1, a lightweight flow-matching-based encoder-decoder action module efficiently learns temporal motion structure solely from unconditioned action trajectories, without processing visual or language tokens. In Stage~2, this learned prior is transferred to VLA training through decoder reuse and early-stage latent distillation, aligning visual-language features with the action embedding space while still allowing end-to-end policy refinement. In addition, the trained encoder serves as a compact history compressor, summarizing state-action histories into a single temporal context token for history-aware modeling at negligible cost. Extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms validate the effectiveness of our approach. Compared with VLA training without action priors, our model achieves faster convergence, higher success rates, and substantially stronger performance on data-scarce real-world tasks. Moreover, scaling up the action data in Stage~1 yields a more generalizable action prior that directly improves downstream VLA performance.