RoboAtlas: Contextual Active SLAM

2026-06-24 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors created RoboAtlas, a system that helps robots explore and understand large spaces by combining map-making and recognizing objects. It balances exploring new areas with using what it knows about the scene to move smarter. They tested it both in simulations and on a real robot, achieving very high success rates in navigating and completing tasks. Their method works well even with smaller AI models, showing that detailed 3D maps help robots understand their environment better than just upgrading the AI alone.

Active SLAM3D semantic mappingfrontier explorationcontextual multi-armed banditVLM (Vision-Language Model)Unitree Go2 robotGPT-4osemantic reasoningnavigationrobotic exploration

Authors

Alexander Schperberg, Shivam K. Panda, Abraham P. Vinod, M. K. Jawed, Stefano Di Cairano

Abstract

We present RoboAtlas, a contextual Active SLAM framework that adaptively balances geometric exploration and semantic reasoning using a scalable 3D semantic mapping system, OpenRoboVox. RoboAtlas integrates frontier exploration, global semantic-map reasoning, and egocentric VLM-based reasoning through a contextual multi-armed bandit that transitions from exploration to semantically guided navigation as scene understanding improves. We evaluate the system in simulation and on a Unitree Go2 robot in large-scale real-world environments exceeding 1800 m2 with approx. 30k mapped semantic instances, achieving a 100% task success rate. On the GOAT-Bench "Val Unseen" benchmark, RoboAtlas achieves state-of-the-art performance with highest reported success rate (SR) of 90.6%, using GPT-4o, improving over the strongest prior baseline by 17.8 percentage points in SR. Using the much smaller Qwen2.5-VL-7B model, it still achieves 88.8% SR, outperforming all baselines using GPT-4o in SR, and revealing the importance of the information gained by our semantic mapping framework over simply replacing the underlying foundation model. The results demonstrate that grounding foundation models with large-scale 3D semantic maps enables robust and efficient contextual Active SLAM.

View PDFOpen arXiv