S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

2026-03-26Computation and Language

Computation and Language
AI summary

The authors propose S2D2, a new method to make block-diffusion language models generate text faster without extra training. Their idea uses the same model in two ways: one that guesses many words at once and another that checks those guesses word-by-word. This approach smartly decides when to verify the guesses, improving speed and accuracy compared to previous methods. They tested S2D2 on several models and showed it can be much faster while often producing better results.

block-diffusion language modelsautoregressive decodingparallel denoisingspeculative decodingself-speculative decodingconfidence-thresholdingdynamic decodingsequence-level critic
Authors
Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava
Abstract
Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.