From Web to Pixels: Bringing Agentic Search into Visual Perception

2026-05-12Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study a hard version of visual perception where identifying an object needs extra knowledge from the web or facts, not just the image itself. They create WebEye, a benchmark dataset with images, annotated objects, and questions that require searching for information to find or understand objects. They also develop Pixel-Searcher, a tool that searches for evidence and links it to exact object locations or answers. Their results show Pixel-Searcher works best on this challenging task, though it still struggles with getting the right evidence and matching it to images.

visual perceptionopen-world recognitionobject groundingbenchmark datasetsemantic understandingmulti-hop reasoningobject segmentationvisual question answeringknowledge-based search
Authors
Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue
Abstract
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.