SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

2026-02-26 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors study how to connect vision and language models so they can understand images and text together, but with fewer paired examples than usual. They create a method called SOTAlign that first uses a small set of paired images and text to find a rough connection, then improves this by learning from many unpaired examples using a special math tool called optimal transport. This approach lets the models learn shared meanings even when paired data is limited, performing better than other methods that need more supervision. The work shows it's possible to align different kinds of neural networks in a semi-supervised way effectively.

neural networksvision modelslanguage modelssemi-supervised learningpaired dataunpaired dataoptimal transportcontrastive lossjoint embeddingsrepresentation alignment

Authors

Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

View PDFOpen arXiv