Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

2026-06-25Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study a method called Top-k sparse autoencoders (SAEs), which simplify complex vision model features by keeping only the most important parts active. They note that current Top-k SAEs fix the number of active parts regardless of input complexity and may overfit to this fixed number. To fix this, the authors add two new soft sparsity rules that gently encourage fewer active parts before choosing the top-k ones. Their tests show these rules help create clearer, more focused features without hurting the model's ability to reconstruct inputs. They conclude that combining a hard limit on active parts with these soft rules works better than using either one alone.

Sparse autoencoderTop-k sparsitySparsity regularizationMonosemantic featuresActivation functionVision foundation modelsL1 penaltyL1/L2 ratio penaltyOverfittingLinear probing
Authors
Nathanaël Jacquier, Maria Vakalopoulou, Mahdi S. Hosseini
Abstract
Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top-$k$ SAE, a now-standard variant, enforces sparsity architecturally through its activation function, retaining only the $k$ most active latents per input. Because it was designed precisely to avoid the $\ell_1$ penalty used by earlier SAEs and its known drawbacks, it has not been combined with an explicit sparsity regularizer, despite retaining limitations of its own, such as a budget $k$ that is fixed regardless of input complexity and a tendency to overfit to the training value of $k$. We introduce two sparsity regularizers compatible with the Top-$k$ architecture, both acting on the activations before the Top-$k$ selection: an $\ell_1$ penalty on the unselected (off-support) units, and a scale-invariant $\ell_1/\ell_2$-ratio penalty that concentrates the code onto fewer effective units. Both penalties are applied only to the batch-active units, those selected by the Top-$k$ operator at least once within the batch. Across two datasets, three vision foundation models, and a range of $k$, both regularizers consistently improve monosemanticity at no cost to reconstruction quality. The $\ell_1/\ell_2$ penalty further concentrates information into fewer latents, making reconstruction more robust to the inference-time choice of $k$ and improving small-budget linear probing. Our central finding is that hard architectural sparsity and soft sparsity regularization are complementary rather than mutually exclusive.