Scaling Beyond Masked Diffusion Language Models

2026-02-16Machine Learning

Machine LearningComputation and Language
AI summary

The authors studied different ways of using diffusion models for generating language, focusing on how they scale with size. They found that while Masked diffusion models are popular because they score well on standard tests, other diffusion methods can be faster and more practical despite worse scores. They improved Masked diffusion efficiency with a simpler training approach and showed that bigger models using a uniform-state diffusion method performed well on difficult reasoning tasks, even if their usual scores were lower. This suggests that speed and practicality can be as important as traditional performance metrics like perplexity when comparing these models.

diffusion language modelsmasked diffusiondiscrete diffusionautoregressive modelsperplexityscaling lawscross-entropy losslikelihoodGSM8K benchmarkspeed-quality Pareto frontier
Authors
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic
Abstract
Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms