How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

2026-04-09Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied if large multimodal models (LMMs), which understand images and language, can perform tasks like humans do when navigating complex 3D urban spaces. They created a big dataset focused on 3D actions and urban details, then tested 17 different models for navigation skills. Their results show these models can take some actions but still fall short of human levels, especially because navigation mistakes can quickly lead to big errors. The authors also explored ways to improve the models, such as better understanding space and memory.

Large Multimodal ModelsSpatial Decision-MakingEmbodied Navigation3D Urban SpacesGoal-Oriented NavigationVisual-Linguistic ReasoningGeometric PerceptionSpatial ImaginationLong-Term MemoryNavigation Error Bifurcation
Authors
Baining Zhao, Ziyou Wang, Jianjie Fang, Zile Zhou, Yanggang Xu, Yatai Ji, Jiacheng Xu, Qian Zhang, Weichen Zhang, Chen Gao, Xinlei Chen
Abstract
Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic information. Then, we comprehensively assess 17 representative models, including non-reasoning LMMs, reasoning LMMs, agent-based methods, and vision-language-action models. Experiments show that current LMMs exhibit emerging action capabilities, yet remain far from human-level performance. Furthermore, we reveal an intriguing phenomenon: navigation errors do not accumulate linearly but instead diverge rapidly from the destination after a critical decision bifurcation. The limitations of LMMs are investigated by analyzing their behavior at these critical decision bifurcations. Finally, we experimentally explore four promising directions for improvement: geometric perception, cross-view understanding, spatial imagination, and long-term memory. The project is available at: https://github.com/serenditipy-AC/Embodied-Navigation-Bench.