Physics-Guided Policy Optimization with Self-Distillation

2026-06-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied a way to improve how large language models learn from their own predictions, called self-distilled policy optimization (SDPO). They noted that SDPO can be unstable because it treats all learning steps equally, even when some are less reliable. To fix this, they created a new method named Physics-Guided Policy Optimization (PGPO) that uses ideas from fluid dynamics and measures how much useful information the model gets before deciding how big each learning step should be. Their tests showed PGPO works better and stays stable on a science question-answering task compared to the original approach.

Self-distilled policy optimizationLarge language modelsPost-trainingMutual informationStep size modulationStochastic differential equationsVanilla SGDScience-QA datasetTraining stabilityFluid dynamics analogy
Authors
Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei
Abstract
Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.