From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

2026-04-15Machine Learning

Machine LearningArtificial IntelligenceComputation and Language
AI summary

The authors address a limitation in improving large language models by focusing on adjusting the overall output distribution (P(y)) rather than just the outputs given inputs (P(y|x)). They introduce PreRL, a method that updates the model’s output distribution using rewards in real time, and find that it aligns well with standard reinforcement learning techniques. A key part of their approach is Negative Sample Reinforcement, which helps the model quickly eliminate wrong answers and encourages better reasoning. They then propose Dual Space RL, which starts with PreRL to broaden the model’s reasoning ability before fine-tuning with standard reinforcement learning, showing better performance in experiments.

Reinforcement Learning (RL)Large Language Models (LLMs)Pre-train SpaceP(y|x) Conditional DistributionP(y) Marginal DistributionNegative Sample Reinforcement (NSR)Policy ReincarnationDual Space RL (DSRL)Gradient AlignmentReasoning Enhancement
Authors
Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
Abstract
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.