DiffusionBench: On Holistic Evaluation of Diffusion Transformers

2026-06-23Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors point out that most research on diffusion transformers (DiTs) for image generation focuses only on generating images based on ImageNet class labels. They created NanoGen, a flexible system that makes it easy to train both ImageNet and text-to-image (T2I) DiT models using similar computing power. Their tests show that improving performance on ImageNet does not guarantee improvements for text-to-image tasks, suggesting both should be evaluated. To help with this, the authors propose DiffusionBench, a new benchmark combining both evaluation setups to better measure progress in DiT research.

Diffusion Transformer (DiT)ImageNetText-to-Image GenerationNanoGenLatent Diffusion ModelsFID (Fréchet Inception Distance)Class-Conditional GenerationVariational Autoencoder (VAE)BenchmarkingPearson Correlation
Authors
Xingjian Leng, Jaskirat Singh, Zhanhao Liang, Ethan Smith, Martin Bell, Aninda Saha, Yuhui Yuan, Liang Zheng
Abstract
Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.