Counterfactual Stress Testing for Image Classification Models

2026-05-11Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors highlight that deep learning models in medical imaging can struggle when used in new clinical settings because of differences like patient groups or scanner types. They point out that traditional testing methods use simple tweaks to images that don't realistically reflect these differences. To improve this, the authors developed a new testing approach using causal generative models that create realistic 'what if' images by changing specific factors like scanner type or patient sex without altering the anatomy. Their tests show this method better predicts how models will perform in real-world scenarios compared to older methods. This approach could help better evaluate medical AI before it's actually used in clinics.

deep learningmedical imagingdistribution shiftunderspecificationstress testingcausal generative modelscounterfactual imagesdomain adaptationrobustness assessmentout-of-distribution performance
Authors
Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker
Abstract
Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.