AI summaryⓘ
The authors found that large language models (LLMs), which are usually trained to predict one word at a time, can actually predict multiple future words at once without extra training. They developed a simple method that uses special mask tokens to guess several next words in parallel, speeding up text generation without changing the model itself. Their approach works by building a tree of possible next words and pruning less likely options to keep predictions efficient and accurate. Tests show this method improves speed and prediction length on popular models like LLaMA3 and Qwen3. The authors also explain why these models can do this naturally, thanks to the way their layers handle mask tokens and next-word predictions.
Large Language ModelsMulti-token PredictionMask TokensSpeculative DecodingToken ThroughputLLaMA3Qwen3Next-token PredictionParallel DecodingModel Pruning
Authors
Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott
Abstract
Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.