Gradient Boosting within a Single Attention Layer

2026-04-03 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors propose a new way to improve how attention works in transformers by using a method called gradient-boosted attention. Instead of just calculating attention once, their approach makes a second pass that tries to fix errors from the first. This process is inspired by gradient boosting, a technique from machine learning, allowing the model to gradually correct its mistakes in a controlled way. Their tests show that this method performs better than regular attention and some other variants on a language task.

TransformerAttention mechanismGradient boostingSoftmaxHopfield networkResidual learningSquared reconstruction objectivePerplexityWikiText-103Twicing

Authors

Saleh Sargolzaei

Abstract

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.

View PDFOpen arXiv