SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

2026-03-12Computation and Language

Computation and LanguageArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors created a new method called synthesize-and-reground to build large datasets that help computers understand scientific documents with both text and images. First, they generate focused question-and-answer pairs based on parts of the papers, then they place these pairs back into full documents to keep things realistic. Using this, they made SciMDR, a big dataset with lots of questions and reasoning steps from scientific papers, plus SciMDR-Eval, a test set checked by experts. They showed that training models on SciMDR helps computers perform better, especially on tasks needing detailed understanding of whole scientific documents.

multimodal reasoningdataset constructionscientific QAfoundation modelscross-modal comprehensionclaim-centric QAdocument-scale reasoningSciMDRreasoning chainsbenchmark evaluation
Authors
Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan
Abstract
Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.