Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

2026-04-10Machine Learning

Machine Learning
AI summary

The authors studied why medical vision language models (VLMs) sometimes give wrong or unreliable answers, especially when the question is asked differently. They found that these problems happen mostly near the model's decision threshold, where it’s unsure. By testing various methods, they showed that simple uncertainty measures, like checking the model's confidence in one go, work better than complicated methods to spot when answers might be wrong or sensitive to rephrasing. Some ensemble methods failed under certain dataset shifts, but a simple technique called MC Dropout improved calibration and error detection. Overall, the authors found that straightforward methods are more effective for identifying unreliable predictions.

Medical Vision Language ModelsUncertainty QuantificationDecision BoundaryPredictive EntropyCalibrationEnsemble MethodsMC DropoutCross-Dataset ValidationOut-of-DistributionSelective Prediction
Authors
Binesh Sadanandan, Vahid Behzadan
Abstract
Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.