The Role of Feedback Alignment in Self-Distillation
2026-06-09 • Artificial Intelligence
Artificial IntelligenceMachine Learning
AI summaryⓘ
The authors explored how to best design feedback for improving language models through self-distillation, a training method that helps models keep improvements even without extra context. They compared three types of feedback: simple rewards, full reference answers, and detailed step-by-step critiques aligned with the model's own reasoning. Their results showed that giving feedback matched to the model’s reasoning steps works best, because it fixes only the mistakes without changing the parts that are already correct. This means that aligning feedback structure with how the model thinks is important for better learning.
language modelself-distillationcontext conditioningfeedbackstep-by-step critiquereference solutionreward signalreasoning tracemodel training
Authors
Semih Kara, Oğuzhan Ersoy
Abstract
Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.