Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

2026-04-14Machine Learning

Machine LearningArtificial IntelligenceComputation and Language
AI summary

The authors studied how on-policy distillation (OPD), a method used to improve large language models after training, actually works. They found that OPD works well only if the teacher and student models think in similar ways and if the teacher provides new knowledge that the student hasn't seen before. They also discovered that successful OPD aligns the student and teacher on a small set of important tokens that the student frequently encounters. To fix cases where OPD fails, the authors suggest two strategies: starting with off-policy data and carefully choosing prompts that match the teacher. They also highlight that OPD's benefits come with costs, especially when trying to apply it over long sequences.

on-policy distillationlarge language modelsteacher-student modelstraining dynamicstoken alignmentreverse distillationprompt selectionoff-policy learningpost-traininglong-horizon distillation
Authors
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
Abstract
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.