MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits

2026-04-10Computation and Language

Computation and LanguageInformation Retrieval
AI summary

The authors address the challenge of answering questions from documents by using images of the pages, where often only a few pages are chosen, potentially missing useful information. They introduce a method called MAB-DQA that breaks down a question into parts and treats each part separately, retrieving pages relevant to each. By using a strategy inspired by the multi-armed bandit problem, their system learns which parts of the question are most important and focuses more on those. This approach helps find better information across multiple pages and improves performance on several tests.

Document Question AnsweringMultimodal Retrieval-Augmented GenerationVisual Document UnderstandingMulti-Armed BanditAspect-aware SubqueriesExploration-ExploitationInformation RetrievalPage ImagesDocument Layout AnalysisQuestion Decomposition
Authors
Yixin Xiang, Yunshan Ma, Xiaoyu Du, Yibing Chen, Yanxin Zhang, Jinhui Tang
Abstract
Document Question Answering (DQA) involves generating answers from a document based on a user's query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at https://github.com/ElephantOH/MAB-DQA.