SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

2026-05-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied how well vision-language models (VLMs) understand numbers related to space, like coordinates or measurements, when these models interact with their environment. They created tests to see if models can connect spatial layouts and numerical information in both directions. Their findings show that current VLMs mostly guess numbers without truly understanding spatial relationships and rely on simple visual hints instead of deeper spatial reasoning. While fine-tuning helps a bit, these models still struggle to grasp structured spatial concepts from images. The authors suggest that improving numerical grounding in space remains a challenge for VLMs.

Vision-Language ModelsSpatial numerical understandingEmbodied environmentsNum2SpaceSpace2NumSpatial reasoningCoordinate-aware representationsVisual groundingFine-tuningDynamic transitions
Authors
Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu
Abstract
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.