Normalizing Trajectory Models

2026-05-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning

AI summaryⓘ

The authors present Normalizing Trajectory Models (NTM), a new way to speed up diffusion-based image generation by using fewer, bigger steps. Unlike past methods that lose some mathematical guarantees, NTM keeps exact likelihood calculations by using special invertible transformations in each step. This setup lets the model learn better and even improve itself with a smaller denoiser for fast sampling. In tests on text-to-image tasks, their method matches or beats current models while only needing four steps.

diffusion modelsnormalizing flowslikelihooddenoisingimage generationscore matchingself-distillationinvertible networksflow-matchingtext-to-image synthesis

Authors

Jiatao Gu, Tianrong Chen, Ying Shen, David Berthelot, Shuangfei Zhai, Josh Susskind

Abstract

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

View PDFOpen arXiv