DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

2026-03-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors introduce DynVLA, a new driving model that predicts how the world will change shortly before deciding what to do. They create a special method called Dynamics CoT, which compresses future changes into a small set of tokens to help the model understand the environment better. By separating the car's own movements from the surroundings, their model can make more accurate decisions. Tests show DynVLA works better and faster than older methods that either use just text or lots of images.

Dynamics CoTDynamics Tokenizerego-centric dynamicsenvironment-centric dynamicsSFT (Supervised Fine-Tuning)RFT (Reinforcement Fine-Tuning)VLA (Vision-Language-Action) modelspatiotemporal understandingaction generationcompact world dynamics

Authors

Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

View PDFOpen arXiv