D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

2026-06-03Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingMachine Learning
AI summary

The authors present D²SD, a method to speed up how large language models generate text by predicting several possible next words at once and checking them efficiently. Traditional methods discard many guesses after the first mistake, wasting effort. D²SD improves this by organizing guesses in a tree based on confidence, so it better focuses on promising parts and tries alternatives only where needed. This approach reduces wasted calculations and improves the number of accepted words, outperforming previous similar techniques.

Speculative DecodingAutoregressive ModelsDiffusion ModelsPrefix TreeConfidence ScoringCascade AttentionToken VerificationLarge Language ModelsBatch ProcessingInference Acceleration
Authors
Liyuan Zhang, Jiarui Zhang, Jinwei Yao, Ran Yan, Yuchen Yang, Jiahao Zhang, Tongkai Yang, Yi Wu, Binhang Yuan
Abstract
Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.