Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

2026-06-30 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address a problem where advanced diffusion models (Teachers) and simpler models (Students) use different internal data formats, making it hard to transfer knowledge. They call this Cross-Space Distillation and propose a solution called the Bridge, which is a small module that helps convert Student data into the Teacher's format without changing the Student model itself. This Bridge improves the Student model's performance significantly while keeping speed and compatibility intact. Their method shows that it's possible to teach compact models using large, complex ones even when their internal data spaces differ.

diffusion modelslatent spacetimestep distillationvariational autoencoder (VAE)knowledge distillationcross-space distillationlatent resolutionmodel compressionlatent projectioninference latency

Authors

Anh Nguyen, Ngan Nguyen, Duc Vu, Trung Dao, Viet Nguyen, Quan Dao, Kien Nguyen, Chi Tran, Phong Nguyen, Khoi Nguyen, Cuong Pham, Dimitris Metaxas, Vishal M. Patel, Anh Tran

Abstract

Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility. These results show that heterogeneous large Teachers can be distilled into efficient, deployable backbones through a lightweight latent-space interface.

View PDFOpen arXiv