C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

2026-06-29 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors study Sparse Autoencoders (SAEs), which break down language model data into understandable features, but found that when scaled up, these features often get split or merged inconsistently across different examples. They identify that these problems happen because the model assigns features differently in each sample, causing confusion and unreliable results. To fix this, they propose a method called C²R that enforces consistency by making the model represent the same feature uniformly across many samples, reducing splitting and merging issues. Their tests show this method improves how interpretable the features are without hurting the model's accuracy.

Sparse AutoencoderLarge Language ModelsFeature SplittingFeature AbsorptionLatent RepresentationCross-sample ConsistencyRegularizationReconstruction Fidelity

Authors

Haoran Jin, Xiting Wang, Shijie Ren, Hong Xie, Defu Lian

Abstract

Sparse Autoencoders (SAEs) are widely used to interpret large language models by decomposing activations into sparse, human-understandable features, but scaling to large dictionaries exposes fundamental challenges. Systematic studies reveal pervasive feature splitting that fragments coherent concepts into non-atomic latents and widespread feature absorption that creates arbitrary exceptions in general features, severely compromising latent reliability. These issues stem from inconsistent latent assignment across samples: without cross-sample constraints, per-sample optimization often allows a single underlying concept to be inconsistently distributed across multiple redundant or interfering latents. To address this, we introduce C$^2$R (\underline{\textbf{C}}ross-sample \underline{\textbf{C}}onsistency \underline{\textbf{R}}egularization). C$^2$R explicitly encourages that each semantic feature is consistently represented by a unified latent across the batch by penalizing the co-activation of directionally similar latents. Comprehensive evaluation demonstrates that C$^2$R effectively mitigates both splitting and absorption while, crucially, preserving reconstruction fidelity, providing a principled solution that enhances latent interpretability without degrading model performance. Source code is available at https://github.com/hr-jin/Cross-sample-Consistency-Regularization.

View PDFOpen arXiv