BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

2026-03-06Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine LearningRobotics
AI summary

The authors explore how to improve autonomous driving by combining Large Language Models (LLMs), which are good at reasoning, with Bird's-Eye View (BEV) representations, which provide consistent 3D spatial information. They point out that previous methods treated images separately, causing inefficiencies and less accurate 3D understanding. Their proposed system, BEVLM, links spatially consistent BEV data with LLMs, helping the models reason better about driving scenes and improving accuracy by 46%. Additionally, by sharing semantic knowledge back to the BEV, they enhance driving safety performance by 29%.

Large Language ModelsAutonomous DrivingBird's-Eye View (BEV)3D Spatial ReasoningSemantic UnderstandingMulti-view ImagingEnd-to-End DrivingGeometric CoherenceKnowledge Distillation
Authors
Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding
Abstract
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.