Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

2026-02-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose a new method called Self-Supervised Semantic Bridge (SSB) to improve unpaired image-to-image translation, especially for medical images. Their approach uses self-supervised visual encoders to learn features that keep the structure of images while ignoring appearance differences, helping the model translate images more accurately without needing paired training examples. This method addresses problems in previous techniques by better preserving spatial details and works well even on new, unseen data. They also show that SSB can be used for editing images based on text instructions with high quality.

adversarial diffusiondiffusion inversionimage-to-image translationself-supervised learningsemantic priorslatent spaceunpaired translationmedical image synthesisvisual encoderstext-guided image editing
Authors
Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis
Abstract
Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.