VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation
2026-03-27 • Robotics
Robotics
AI summaryⓘ
The authors address problems in teaching robots to perform tasks after initial training, where the usual methods either forget important skills or learn too slowly. They propose a new method called On-Policy VLA Distillation (VLA-OPD) that uses guidance from an expert teacher to correct mistakes during the robot's own practice, helping it learn more efficiently without losing previous abilities. Their approach uses a special mathematical formula (Reverse-KL) to keep learning stable and maintain a variety of actions. Tests show this method improves learning speed and reliability compared to traditional techniques.
Vision-Language-Action (VLA) modelsSupervised Fine-Tuning (SFT)Reinforcement Learning (RL)Distribution shiftCatastrophic forgettingOn-Policy DistillationReverse-KL divergenceSample efficiencyPolicy learningRobotic manipulation
Authors
Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, Haoang Li
Abstract
Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.