MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining
2026-02-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created MedTri, a tool that cleans up medical reports to help computers better learn from medical images like X-rays and CT scans. Instead of using messy, varied text, MedTri turns reports into a simple, consistent format that highlights important body parts and findings. This makes it easier and more accurate for machines to connect images to the right descriptions. They also show that this method helps improve learning quality and can be combined with other techniques to make models more reliable.
Medical Vision-Language PretrainingText NormalizationMedical ReportsAnatomical EntityRadiologic DescriptionDiagnosis CategoryX-rayComputed Tomography (CT)Text AugmentationImage-Grounded Supervision
Authors
Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, Xin Gao
Abstract
Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at https://github.com/Arturia-Pendragon-Iris/MedTri.