A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

2026-06-02Robotics

RoboticsMachine Learning
AI summary

The authors created a new way for robots or virtual agents to understand city navigation by focusing on the empty spaces they can move through, rather than just how buildings look. They represent these spaces as a 3D map showing how far the nearest surface is in all directions, called an isovist. Their model predicts what the agent will see next based on past measurements and movements, maintaining important spatial details and consistency across different places. Interestingly, when trained on two cities, their model could distinguish between them based on navigation dynamics alone, without relying on visual appearance. This approach provides a clear and lightweight way for machines to reason about space and movement in urban environments.

Embodied agentsWorld modelsIsovist3D navigationOccupancy gridsSpatial representationDepth residualScheduled samplingLatent spaceUrban navigation
Authors
Xuhui Lin, Stephen Law, Nanjiang Chen, Kunyao Li, Tao Yang
Abstract
Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.