When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

2026-06-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how the timing of updates between a student and teacher model affects learning stability in self on-policy distillation. They found that having "isolation periods" where the teacher model stays fixed for a while is crucial to prevent learning collapse, rather than just the teacher's age. They identified a failure mode called "state-oblivious collapse," where poorly timed teacher updates can harm learning during long training sessions. To fix this, they designed a method called Consolidation-Gated Teacher Refresh (CGTR) that updates the teacher only when the student shows real improvement, avoiding unnecessary changes. CGTR worked well across multiple tasks without needing extra tuning, preventing collapse and achieving high scores.

self on-policy distillationteacher-student modelstemporal couplingisolation periodsKL divergencestate-oblivious collapseEMA (Exponential Moving Average)Consolidation-Gated Teacher Refresh (CGTR)reinforcement learningpolicy update schedule
Authors
Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang
Abstract
Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.