How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

2026-06-24Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language
AI summary

The authors studied how vision-language models handle recognizing and understanding text when the images are visually messed up or corrupted. They created a new test called OCR-Robust with different types of text images, including documents, handwriting, and charts, and applied several types of visual distortions. They tested 18 models and found that doing well on clear images doesn’t mean the model will handle distorted images well. The study also showed that models struggle more with complex visuals like charts and tables compared to simpler document images when the visuals are degraded.

Vision-Language ModelsOCR (Optical Character Recognition)Visual PerturbationsRobustness BenchmarkClean AccuracyRelative Corruption RetentionCharts and TablesText RecognitionStructural DistortionLanguage Models
Authors
Yuxing Cheng, Yuan Wu, Yi Chang
Abstract
Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluating OCR reasoning robustness under visual perturbations. It contains 812 samples across two complementary subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. To enable efficient yet informative evaluation, we conduct a pilot study over 18 candidate perturbations and select 5 representative types at 3 severity levels each based on their impact and cross-model discriminability. We evaluate robustness using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI), and benchmark 18 models spanning proprietary systems, open-source VLMs, and OCR+LLM pipelines. Our results show that higher clean accuracy does not necessarily imply stronger robustness, and that models can suffer pronounced degradation in the worst case on OCR tasks that are sensitive to structure, and charts and tables are substantially more fragile than document-like inputs under perturbation.