Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors suggest a new way to adapt big, general computer vision models to specialized scientific fields without needing labeled data, which is often hard to get. Their method, called FINO, uses extra information (metadata) from the data itself to teach the model important features while ignoring irrelevant details. They tested FINO on different scientific image datasets like fluorescence microscopy and wildlife monitoring, where it did better than other existing methods, including ones that use labels. This approach keeps the model flexible and accurate without requiring expensive manual labeling.
vision foundation modelsself-supervised learningmetadata guidancedomain adaptationfluorescence microscopyEarth observationwildlife monitoringmedical imagingfeature representationunsupervised learning
Authors
Elouan Gardès, Seung Eun Yi, Kartik Ahuja, Théo Moutakanni, Huy V. Vo, Piotr Bojanowski, Wolfgang M. Pernice, Loïc Landrieu, Camille Couprie
Abstract
We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.