LAD: Learning Advantage Distribution for Reasoning
2026-02-23 • Machine Learning
Machine Learning
AI summaryⓘ
The authors point out that current reinforcement learning methods for large models focus too much on just maximizing rewards, which can make models ignore other good ways to solve problems. They propose a new method called Learning Advantage Distributions (LAD) that instead tries to match a distribution representing different good answers rather than just a single best one. This approach helps the model explore more diverse solutions and avoids getting stuck by overconfidence, all without extra training costs. Their tests on math and coding tasks show that LAD improves both accuracy and the variety of model outputs.
reinforcement learningexpected rewardsadvantage functiondistribution matchingf-divergencepolicy updateentropy regularizationlarge language modelsmultimodal distributionsgenerative diversity
Authors
Wendi Li, Sharon Li
Abstract
Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.