S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

2026-06-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present S-Agent, a system designed to understand and reason about 3D environments by combining information from multiple views and over time, instead of looking at single images alone. S-Agent uses a language model to decide what information to gather, and a set of tools to identify objects in 2D, convert them into 3D data, and build a scene understanding that includes counting and measuring. It also keeps track of what it has learned with memory components to improve reasoning over time. Tests show that S-Agent enhances existing vision-language models without retraining, and with additional training, it performs very well compared to other strong models.

Spatial reasoningVision-language models (VLMs)Multi-view images3D scene understandingTemporal memorySemantic planningSpatio-temporal evidence accumulationAgent memoryTool-augmented agentsSupervised fine-tuning
Authors
Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu
Abstract
Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).