Gumbel Distillation for Parallel Text Generation
2026-03-23 • Computation and Language
Computation and LanguageMachine Learning
AI summaryⓘ
The authors address the problem that parallel language models can generate text faster but with lower quality compared to slow, step-by-step models. They introduce a new technique called Gumbel Distillation that helps parallel models learn to produce better text by using a special method to link noise to token choices from a good teacher model. This approach works with different parallel models and improves text quality significantly in their tests. The authors provide their code for others to use.
autoregressive language modelsparallel decodingGumbel-Max trickknowledge distillationtoken sequenceslatent noise spaceMAUVE scoreperplexityMDLMBD3-LM
Authors
Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu
Abstract
The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.