LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

2026-03-02 • Computation and Language

Computation and Language

AI summaryⓘ

The authors show that teaching large language models to reason well with long text is hard when they only get feedback from the final answer. They prove that rewarding only the answer makes it difficult for the model to learn which parts of the text to pay attention to. To fix this, they introduce a method called LongRLVR that also rewards the model for picking the right pieces of information from the text. This helps the model learn better and performs significantly better on benchmarks with long contexts. Their approach improves models like Qwen and LLaMA by guiding them to use evidence more effectively.

Reinforcement LearningLarge Language ModelsReward SparsityContext GroundingVanishing GradientLong-Context LearningQwen ModelLLaMA ModelRL with Verifiable RewardsLongBench

Authors

Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.

View PDFOpen arXiv