When Can LLMs Learn to Reason with Weak Supervision?

2026-04-20 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors studied how large language models learn reasoning tasks when given weaker or less reliable feedback. They found that models that improve steadily before hitting a performance plateau tend to generalize better, while those that quickly memorize do not. A key factor predicting success is how logically connected a model’s intermediate reasoning steps are to its final answer. They also showed that fine-tuning on clear reasoning examples is necessary for good generalization under weak supervision, and that additional training on relevant data further helps. Applying these methods to a specific language model improved its ability to learn where it previously failed.

large language modelsreinforcement learningreward signalsgeneralizationreward saturationreasoning faithfulnesssupervised fine-tuningpre-trainingnoisy rewardsweak supervision

Authors

Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel, Pavel Izmailov

Abstract

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

View PDFOpen arXiv