PolicyLong: Towards On-Policy Context Extension
2026-04-09 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors identify a problem in training large language models (LLMs) to understand very long texts, where current methods use fixed data that doesn't adapt as the model improves. They introduce PolicyLong, which repeatedly updates the training data based on the model's current knowledge, allowing the model to focus on areas it finds difficult or easy. This creates a natural learning path that evolves alongside the model. Experiments show that this dynamic, on-policy approach outperforms previous fixed-data methods, especially for very long contexts.
large language modelscontext windowlong-range dependenciesentropyon-policy learningdata screeningself-curriculumtraining distributionretrievalpredictive entropy
Authors
Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, TingHao Yu, Feng Zhang, Songlin Hu
Abstract
Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model's predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model's evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model's entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.