Efficient RL Training for LLMs with Experience Replay
2026-04-09 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied a way to reuse past training data for large language models (LLMs) after their initial training, called experience replay. Normally, people think you need fresh data every time to keep the model performing well, but the authors show that's not always true. They found that using a replay buffer—a storage of past experiences—can save a lot of computing power without hurting, and sometimes even improving, the model's results. They also explain how to balance using old data with keeping model diversity and managing costs.
Experience ReplayLarge Language ModelsPost-TrainingOn-Policy SamplingReplay BufferSample DiversityPolicy EntropyComputation CostReinforcement Learning
Authors
Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Remi Munos
Abstract
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.