VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

2026-06-10Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors developed VLGA, a new model that helps computers understand and act in 3D environments by learning to rebuild the scene in detail as they interact with it. Unlike earlier models that only loosely incorporate 3D information or use sparse data, VLGA treats geometry as an important part of its learning alongside vision, language, and actions. They trained VLGA using accurate 3D data from LiDAR scans and tested it on driving datasets, where it outperformed other similar models in predicting actions and avoiding collisions. This shows that including detailed 3D understanding helps improve decision-making in complex scenes.

Vision-language-action (VLA) models3D reconstructionLiDARPer-pixel pointmap regressionnuScenes datasetBench2Drive datasetOpen-loop evaluationClosed-loop evaluationEgo statusDriving score
Authors
Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman
Abstract
Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.