On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
2026-06-24 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors studied a method called on-policy self-distillation where a model teaches itself using its own correct examples. They found that although this helps the model improve accuracy for the best guess, it reduces the variety in its answers and limits gains from generating more attempts. Their analysis shows this happens because the model's feedback reinforces its existing guesses rather than encouraging diverse solutions. Testing on tasks like path-finding and science questions, the authors saw that while self-distilled models perform well on average, they struggle when they need to think of new or different strategies.
on-policy self-distillationpass@1 accuracyrollout diversitytoken-level feedbackreinforcement learningconditional mutual informationfunctional diversitysemantic diversityout-of-distribution generalizationmodel biases
Authors
Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville
Abstract
On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollout and the correct rollout used as context. Unlike the ideal optimal on-policy reinforcement learning (RL), which preserves probability ratios among equally correct rollouts, self-distillation can amplify existing probability gaps, concentrating mass on already-dominant modes. On a controlled graph path-finding task and science question-answering benchmarks, self-distilled models match or exceed RL on average performance but exhibit substantially lower functional and semantic diversity, failing on out-of-distribution settings that require diverse strategies.