NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

2026-02-24Artificial Intelligence

Artificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors work on Vision-Language-Action (VLA) models for self-driving cars, which usually need lots of data and detailed annotations to perform well. They propose a new model called No Reasoning for Driving (NoRDe) that uses less than 60% of the data and no detailed reasoning labels but still matches existing models' performance. They find that a common training method struggles with smaller, simpler datasets due to something called difficulty bias, which they fix by using a newer algorithm called Dr. GRPO. This approach allows their model to learn effectively with less data and fewer annotations, making autonomous driving systems more efficient.

Vision-Language-Action (VLA) modelsautonomous drivingGroup Relative Policy Optimization (GRPO)difficulty biasDr. GRPOreinforcement learningdataset annotationWaymo datasetNAVSIMend-to-end learning
Authors
Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan
Abstract
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.