Reinforced Fast Weights with Next-Sequence Prediction

2026-02-18 • Computation and Language

Computation and Language

AI summaryⓘ

The authors explain that fast weight models are a type of neural network that can handle very long inputs efficiently but usually learn in a way that focuses only on predicting the next single word, missing out on understanding longer sequences. They propose a new training method called REFINE, which uses reinforcement learning to teach these models to predict whole sequences of words, improving how well they grasp longer contexts. REFINE picks important words to focus on and rewards the model based on how well it predicts groups of words, leading to better overall performance. The authors show that this method works better than traditional training on several long-range tasks.

fast weight architecturesattention-based transformersnext-token prediction (NTP)reinforcement learningnext-sequence prediction (NSP)prediction entropymulti-token rolloutspolicy optimizationlong-context modelingpre-trained language models

Authors

Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky

Abstract

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

View PDFOpen arXiv