Learning Situated Awareness in the Real World
2026-02-18 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created SAW-Bench, a new test to see how well AI models understand the world from a first-person view using real videos recorded by smart glasses. Unlike previous tests that focus on how objects relate to each other, this one checks if models can understand things from the viewpoint and movements of the person wearing the glasses. Their results show that current models still struggle a lot compared to humans, especially because they have trouble figuring out the camera’s position and spatial layout correctly. The authors suggest this benchmark can help improve AI's ability to reason about the world around us as experienced by a person.
situated awarenessegocentric videomultimodal foundation modelsobserver-centric relationshipsspatial reasoningcamera geometryRay-Ban Meta glassesquestion-answering benchmarkspatial intelligenceGemini 3 Flash
Authors
Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang
Abstract
A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.