Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
2026-02-12 • Computation and Language
Computation and Language
AI summaryⓘ
The authors study how large language models handle very long texts by compressing parts of the text into smaller, learned tokens. They identify a problem called 'token overflow,' where the compressed version loses important information needed to answer questions correctly. They develop methods to detect when this information loss happens, showing that including the question itself helps better identify these errors. Their approach aims to catch compression problems early to improve overall model accuracy.
large language modelslong-context processingsoft compressiontoken overflowcompressed tokensquery-aware detectionAUC-ROCHotpotQASQuADv2TriviaQA
Authors
Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko
Abstract
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.