The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

2026-03-11 • Machine Learning

Machine Learning

AI summaryⓘ

The authors studied how certain parts of transformer language models, like GPT-2 Small, decide whether a word needs extra special processing. They found that this decision acts like a simple on/off switch using neurons that behave almost like binary signals, even though the underlying data is continuous. They noticed this routing system develops in stages across the model’s layers and confirmed it plays an important role in the model’s performance. Their work suggests language models use binary decisions to steer data through different paths, which explains why simple smooth math struggles to mimic these layers.

transformer language modelsMLP layersbinary neuron activationsGPT-2token processingneural routingperplexitynonlinear processingpiecewise-affine functionsconsensus neurons

Authors

Peter Balogh

Abstract

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture -- seven "default-ON" neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive -- creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% -- exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

View PDFOpen arXiv