Context-Aware RL for Agentic and Multimodal LLMs

2026-06-15 • Computation and Language

Computation and LanguageComputer Vision and Pattern Recognition

AI summaryⓘ

The authors found that large language models struggle when they need to pick out a tiny but important piece of information from a big or tricky context. To fix this, they created a method called ContextRL, which trains models not just to give the right answer, but to also identify which context best supports that answer. They tested this approach on coding and image tasks by giving models pairs of similar contexts and rewarding them for choosing the correct one. Their results showed small but consistent improvements over standard methods, and they confirmed the improvements came from their new training objective rather than just having more data.

Large Language Models (LLMs)Reinforcement LearningContext-aware LearningContrastive LearningMultimodal ReasoningLong-horizon ReasoningVisual Question AnsweringData AugmentationAuxiliary Objective

Authors

Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath, Prateek Mittal, Xingyu Fu

Abstract

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

View PDFOpen arXiv