The State-Prediction Separation Hypothesis

2026-07-01Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning
AI summary

The authors suggest that Transformers, which are models used to predict words in sentences, work better if they separate the parts of the model that remember past information from the parts that predict the next word. They created a new version of the Transformer with two separate streams for these tasks and found it learns more efficiently and performs better on tests. Their experiments show consistent improvements over standard Transformers, and they confirmed these gains are not due to other factors. They also explored the differences in training signals that their design causes.

Transformerlanguage modelingstate predictionnext-token predictioncomputation streampretrainingvalidation lossdownstream tasksgradientdata efficiency
Authors
Giovanni Monea, Nathan Godey, Kianté Brantley, Yoav Artzi
Abstract
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.