Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

2026-02-13Information Retrieval

Information Retrieval
AI summary

The authors study how making visual searches work better when images aren’t perfect, like when they are blurry or distorted. They found that current methods often fail because they assume images are always good quality. To fix this, the authors created V-QPP-Bench, a test that helps AI figure out when and how to fix these visual problems before searching. Their experiments show that fixing images first can really improve performance, but AI models need specialized training to do this well. They also found that smaller, fine-tuned models can perform as well as bigger ones with the right training.

Multimodal Retrieval-Augmented GenerationVisual Query Pre-processingMultimodal Large Language ModelsImage Quality DegradationGeometric DistortionsRetrieval RecallFine-tuningAgentic Decision-makingBenchmarkPerceptual Tools
Authors
Jiankun Zhang, Shenglai Zeng, Kai Guo, Xinnan Dai, Hui Liu, Jiliang Tang, Yi Chang
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect'' -- suffering from geometric distortions, quality degradation, or semantic ambiguity -- leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability -- visual imperfections severely degrade both retrieval recall and end-to-end MRAG performance; (2) Restoration Potential \& Bottleneck -- while oracle preprocessing recovers near-perfect performance, off-the-shelf MLLMs struggle with tool selection and parameter prediction without specialized training; and (3) Training Enhancement -- supervised fine-tuning enables compact models to achieve comparable or superior performance to larger proprietary models, demonstrating the benchmark's value for developing robust MRAG systems The code is available at https://github.com/phycholosogy/VQQP_Bench