SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

2026-05-26Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created SpatialBench, a large and carefully designed benchmark to test how well spatial foundation models work across many different tasks, scenes, and data types. They found that current models are not yet versatile enough to handle all situations perfectly. Their tests showed that using full-context attention helps with accuracy, while techniques that limit memory allow models to handle longer sequences. They also discovered that matching the data domain closely and having high-quality data are more important than just having more data. To help improve this area, they introduced a big new dataset called DA-Next-5M and a strong new model named DA-Next.

spatial foundation modelsgeneralizationbenchmarkfull-context attentionbounded-memory strategiesdomain alignmentdataset scalingspatial domainsembodied tasksegocentric tasks
Authors
Haosong Peng, Hao Li, Jiaqi Chen, Yuhao Pan, Runmao Yao, Yalun Dai, Fushuo Huo, Fangzhou Hong, Zhaoxi Chen, Haozhao Wang, Dingwen Zhang, Ziwei Liu, Wenchao Xu
Abstract
While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.