Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors show a way to link two types of computer vision models that have different strengths: one understands images with language concepts but is less detailed, and the other sees images clearly but without language meaning. Their method, called GPUA, treats the detailed model’s features like a new language and learns to map it into the language-based model’s space without needing extra training data or changing the models. This helps both models work better together for tasks like recognizing and segmenting images, especially in zero-shot scenarios where no new training happens. Their approach works on many tasks and adds little extra computational cost.
Foundation modelsComputer visionVision-language foundation models (VLMs)Vision-only foundation models (VFMs)Unsupervised alignmentCross-lingual alignmentOrthogonal mappingZero-shot recognitionSemantic alignmentFeature space
Authors
Shuwen Yu, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li
Abstract
Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA