Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
2026-03-05 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors identify a speed problem in how Diffusion Language Models (DLMs) decide which words to finalize during text generation. They propose a new method called Longest Stable Prefix (LSP) that picks a long, continuous chunk of confident words from the start of the sentence to lock in all at once, instead of scattered words. This approach helps the model keep memory efficient and reduces repeated corrections, making the process faster without hurting quality. Tests show LSP speeds up text generation by over three times on various tasks like math and coding.
Diffusion Language ModelsDecoding SchedulerKey-Value CacheLongest Stable PrefixToken StabilityDenoising StepBidirectional LookaheadInference SpeedText GenerationModel-Agnostic
Authors
Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan
Abstract
Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.