Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

2026-03-06Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors explain that current methods for teaching AI to create images, videos, or sounds rely on extra models that can be tricky and not always well connected to the main task. They propose Self-Flow, a new way that helps the AI learn to understand meaning directly while it learns to create content, without needing outside help. Their method uses different noise levels on parts of the data to encourage the AI to figure out missing details, which builds stronger understanding. This approach works for many types of data and improves the quality of generated media.

diffusion modelsflow modelssemantic representationsself-supervised learningflow matchingnoise schedulingmulti-modal traininggenerative modelsdenoising
Authors
Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, Robin Rombach
Abstract
Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.