Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation
2026-04-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionRobotics
AI summaryⓘ
The authors address problems in robots navigating new places using vision and language by introducing a method called Three-Step Nav. Their system looks ahead to identify big landmarks and make a rough plan, then checks the current view closely to follow smaller goals, and finally reviews the whole path to fix mistakes before stopping. This approach improves success without needing extra training or complex adjustments. Their method works well on popular benchmarks for vision-based navigation.
Vision-and-Language NavigationMultimodal Large Language ModelsZero-shot LearningLandmark RecognitionTrajectory CorrectionRobot NavigationR2R-CE DatasetRxR-CE DatasetMultimodal Planning
Authors
Wanrong Zheng, Yunhao Ge, Laurent Itti
Abstract
Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.