Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

2026-04-09Robotics

Robotics
AI summary

The authors review the field of Aerial Vision-and-Language Navigation (Aerial VLN), where drones use language instructions to navigate 3D environments. They categorize existing methods into five types based on how these systems are built and how they process instructions and perception. The paper also examines current tools and benchmarks used to test these systems, pointing out limitations like lack of real-world testing and diverse environments. Finally, the authors highlight seven open challenges for future research, such as improving navigation over long instructions and deploying these methods on actual drones.

Aerial Vision-and-Language NavigationUnmanned Aerial Vehicles (UAVs)Large Language Models (LLMs)Vision-Language Models (VLMs)Sequence-to-Sequence ModelsHierarchical MethodsSimulation-to-Reality Gap6-DoF NavigationMulti-Agent SystemsBenchmark Datasets
Authors
Xingyu Xia, Lekai Zhou, Yujie Tang, Xiaozhou Zhu, Hai Zhu, Wen Yao
Abstract
Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.