Adaptation to Intrinsic Dependence in Diffusion Language Models
2026-02-23 • Machine Learning
Machine LearningInformation Theory
AI summaryⓘ
The authors study diffusion language models (DLMs), which generate text by gradually revealing tokens in parallel rather than one-by-one like traditional methods. They introduce a new way to decide how many tokens to reveal at each step, using randomness instead of a fixed schedule, without needing prior knowledge about the data. Their analysis shows that this approach adapts well to the underlying structure of the data, improving theoretical guarantees for how quickly the generated text matches the true data distribution. This means their method can speed up the sampling process, especially when the data is simpler or has fewer dependencies.
Diffusion language modelsAutoregressive modelsUnmasking scheduleToken generationParallel samplingKullback-Leibler divergenceTotal correlationDual total correlationSampling convergenceIntrinsic data dependence
Authors
Yunxiao Zhao, Changxiao Cai
Abstract
Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) approaches, enabling parallel token generation beyond a rigid left-to-right order. Despite growing empirical success, the theoretical understanding of how unmasking schedules -- which specify the order and size of unmasked tokens during sampling -- affect generation quality remains limited. In this work, we introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution, without requiring any prior knowledge or hyperparameter tuning. In contrast to prior deterministic procedures that fix unmasking sizes, our method randomizes the number of tokens revealed at each iteration. We show that, for two specific parameter choices, the sampling convergence guarantees -- measured by Kullback-Leibler (KL) divergence -- scale as $\widetilde O(\mathsf{TC}/K)$ and $\widetilde O(\mathsf{DTC}/K)$ respectively. Here, $K$ is the number of iterations, and $\mathsf{TC}$ and $\mathsf{DTC}$ are the total correlation and dual total correlation of the target distribution, capturing the intrinsic dependence structure underlying the data. Importantly, our guarantees hold in the practically relevant parallel-sampling regime $K<L$ where $L$ is the token sequence length. These results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions. Overall, our findings unveil the adaptivity of DLMs to intrinsic data structures and shed light on the benefit of randomized unmasking sizes in inference schedule design.