Reinforcement Learning from Rich Feedback with Distributional DAgger

2026-06-03 • Machine Learning

Machine LearningArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors explore how to improve reasoning models by using richer feedback than just correct-or-incorrect rewards. They propose a new learning method called DistIL, which uses expert guidance more effectively by focusing on where the model disagrees with the expert and improving earlier decisions accordingly. Unlike previous methods, their approach guarantees steady policy improvements and better success rates. They demonstrate that DistIL works better than traditional reinforcement learning in tasks like scientific reasoning, coding, and math problems.

reinforcement learningimitation learningDAggercross-entropypolicy improvementself-distillationexpert feedbackPass@Ncredit assignment

Authors

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

Abstract

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

View PDFOpen arXiv