VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

2026-06-02Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summary

The authors point out that current AI models struggle with complex tasks that require looking carefully across different parts of an image and connecting multiple clues to answer questions. To test these abilities better, they created VistaHop, a new set of images and question tasks that need multi-step reasoning and checking visual details repeatedly. They also made VistaArena, a testing environment that helps models use different tools like image cropping and searching while answering. Their experiments show that even the best models perform poorly on these tasks, revealing ongoing challenges in careful visual understanding and reasoning. This work highlights the need for better tests and training to improve AI's deep image reasoning skills.

Visual DeepSearchMultimodal Large Reasoning Model (MLRM)Multi-hop Visual ReasoningVisual AnchorsIterative Image InspectionEvidence IntegrationBenchmark DatasetVistaHopVistaArenaPass@1
Authors
Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin
Abstract
Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.