Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the difficulty of analyzing ultrasound images of adnexal masses, which often suffer from inconsistent readings by different doctors. They developed a new computer model that needs fewer labeled examples by using a powerful pretrained vision transformer combined with a method to capture both broad and detailed image features. Tested on a large set of patient images, their approach outperformed traditional methods, especially in accurately outlining the mass boundaries. Additionally, their model works well even with limited training data, showing promise for use in real clinical settings where labeled data is scarce.

Adnexal massUltrasound imagingImage segmentationVision transformerSelf-supervised learningDINOv3Dice scoreHausdorff distanceData efficiencyDense Prediction Transformer
Authors
Francesca Fati, Alberto Rota, Adriana V. Gregory, Anna Catozzo, Maria C. Giuliano, Mrinal Dhar, Luigi De Vitis, Annie T. Packard, Francesco Multinu, Elena De Momi, Carrie L. Langstraat, Timothy L. Kline
Abstract
Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA