Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
2026-04-09 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors study how transformer models can better combine multiple steps of reasoning in one go, called implicit reasoning. They find that normal transformers struggle to generalize when reasoning gets more complex or deeper than what they saw during training. By using recurrent-depth transformers, which repeatedly reuse the same layers, the authors show that these models can generalize better both to new combinations of knowledge and longer chains of reasoning. However, too many repetitions can cause the model to 'overthink' and make worse predictions. They also analyze the training process and suggest ways to improve this reasoning ability.
implicit reasoningtransformerscompositional generalizationsystematic generalizationdepth extrapolationrecurrent-depth transformersmulti-hop reasoninggrokkingoverthinkinginference-time recurrence
Authors
Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, Yuekun Yao
Abstract
We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.