Learning from the Self-future: On-policy Self-distillation for dLLMs
2026-06-16 • Computation and Language
Computation and Language
AI summaryⓘ
The authors study a new way to improve diffusion large language models (dLLMs) by teaching the model using its own generated answers, but in a way that fits with how dLLMs work differently from traditional autoregressive models. Instead of teaching step-by-step with earlier parts of a sentence, they teach using later parts (suffixes) and focus on the denoising steps unique to dLLMs. Their method, called d-OPSD, showed better results and needed less training time compared to other methods on several reasoning tests. This suggests a promising way to improve dLLMs after they have been initially trained.
on-policy self-distillationdiffusion large language modelsautoregressive modelssuffix conditioningstep-level supervisioniterative denoisingreinforcement learning from value regressionsupervised fine-tuningpost-training
Authors
Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu
Abstract
On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.