Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization
2026-02-16 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors created a large test set called PAPerBench to study how using longer amounts of information (context) affects how well language models personalize responses and protect user privacy. They tested models with context lengths from 1,000 to 256,000 words and found that as the context gets longer, the models do worse at both personalization and maintaining privacy. They also explain this problem with a theory about how the models pay less attention to important parts when the context is very long. Their work shows current models struggle to balance long context understanding with focus, and they provide their test set for others to use.
large language modelscontext lengthpersonalizationprivacy leakageattention mechanismTransformersbenchmarkattention dilutionscalabilityevaluation
Authors
Shangding Gu
Abstract
Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench