Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
2026-06-11 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied how on-policy distillation (OPD), a method combining live student feedback with detailed teacher guidance, changes a model’s parameters. They found that OPD updates are small, focused on specific parts of the model (mainly feed-forward networks), and that training just these parts nearly matches the full method’s performance. They also showed that OPD’s parameter changes have special geometric patterns and mostly affect weights close to zero, meaning it doesn’t simply rewrite parameters densely. Finally, the authors noted that a common optimizer (AdamW) works better than one encouraging sparsity, due to the nature of the feedback signals involved.
on-policy distillationparameter sparsityfeed-forward networksstudent trajectoriesteacher supervisionAdamW optimizerstochastic gradient descentspectral analysismodel fine-tuningparameter geometry
Authors
Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye
Abstract
On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.