Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2026-06-17Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors studied how to adapt large language models to the medical field using French medical question-answering tasks. They tested different training methods and models to see which worked best for multiple-choice and open-ended questions. They found that fine-tuning alone is usually a strong and cost-effective choice, while combining methods can help more with open-ended answers. Their work also showed that adapting models in French can help with English tasks. They provide practical advice for choosing adaptation methods when computing power is limited.

large language modelsdomain adaptationmedical question answeringfine-tuningpretrainingmultiple-choice QAopen-ended QAcross-lingual transferautomatic evaluationinstruction tuning
Authors
Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre
Abstract
The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.