Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

2026-05-06 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed WALDO, a new method to find unusual spots in medical images without prior training, by comparing suspicious areas to reference images of normal anatomy. Their approach uses math from optimal transport theory to carefully select and compare image patches, improving detection by focusing on references that are neither too similar nor too different. WALDO showed better accuracy in spotting brain abnormalities on MRI scans compared to other zero-shot methods, with consistent results across different models. The authors also provide their code to help others try out their approach.

Zero-shot learningAnomaly localisationVision-language modelsOptimal transport theorySliced Wasserstein distanceDINOv2Brain MRIReference distributionsBias-variance trade-off

Authors

Bernhard Kainz, Johanna P Mueller, Matthew Baugh, Cosmin Bercea

Abstract

Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}\%$ mAP@30 (95\% CI: [40.4, 46.7]), representing a 19\% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}\%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}\%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .

View PDFOpen arXiv