MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

2026-03-26 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors discuss how Vision Foundation Models (VFMs) are typically used with images at just one scale during inference, even though different image sizes can provide different useful information. They propose a method called Multi-Resolution Fusion (MuRF), which processes the same image at multiple resolutions and combines the features to get a better overall understanding. This technique works without retraining models and can be applied broadly to various types of VFMs. The authors tested MuRF on different models and tasks, showing it improves visual representations by capturing both global and fine details.

Vision Foundation Modelsmulti-resolutioninferencefeature fusionDINOv2contrastive learningimage representationsemantic recognitionfine-grained detailscomputer vision

Authors

Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee

Abstract

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

View PDFOpen arXiv