ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

2026-07-02Artificial Intelligence

Artificial Intelligence
AI summary

The authors identify a problem where large language models (LLMs) can read long texts but often miss key information needed to answer questions correctly. They propose a new method called RECONTEXT that helps the model pick out and review the most relevant parts of the input again before answering, without needing extra training or cutting down the context. This method treats the input like a memory store and uses internal signals to find and replay important evidence. Their tests show RECONTEXT improves performance on very long documents across several big language models. They also share their code for others to use.

large language modelslong-context reasoningrecursive evidence replayinference methodassociative memorycontext windowattention mechanismevidence utilization
Authors
Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, Zhining Liu, Lingjie Chen, Ismini Lourentzou, Hanghang Tong, Jingrui He
Abstract
Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.