Self-Augmenting Retrieval for Diffusion Language Models
2026-06-04 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors study a way that some AI language models generate text by guessing many words at once and keeping only the confident ones. They find that the unsure guesses can still give useful hints about important topics early on. Using this insight, they propose a method called SARDI that uses these hints to fetch helpful information while the model is still deciding what to say. This approach works without extra training, can use any retrieval method, and improves performance on complex question-answering tasks while being faster than existing methods.
discrete diffusion language modelsdenoisingretrieval-augmented generationself-augmenting retrievalSARDImulti-hop question answeringretriever-agnosticparallel generationlow-confidence tokenstraining-free methods
Authors
Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger
Abstract
Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.