MediX-R1: Open Ended Medical Reinforcement Learning
2026-02-26 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created MediX-R1, a system that teaches medical AI models to give detailed and accurate answers using both text and images, not just multiple-choice answers. They designed a special way to reward the AI for making correct, meaningful responses by checking if answers are right, recognizing medical terms, and ensuring clear explanations. They also introduced a new method to test these AI models using another AI to judge how good the answers are. With a relatively small amount of training data, MediX-R1 performed very well on medical questions, especially those needing open-ended answers, showing promise for better medical AI tools.
Reinforcement LearningMultimodal Large Language ModelsMedical ReasoningVision-Language ModelsSemantic RewardLLM-based EvaluationOpen-ended QAInstruction TuningMedical EmbeddingsReference-based Evaluation
Authors
Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
Abstract
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com