Tracking Equivalent Mechanistic Interpretations Across Neural Networks

2026-03-31Machine Learning

Machine LearningComputation and Language
AI summary

The authors explore a new way to understand how different neural networks might use the same underlying reasoning process, even if we don't know exactly what that process is. They introduce the idea of 'interpretive equivalence,' meaning two models share the same interpretation if all ways of implementing these interpretations behave the same. They create an algorithm to check this and study it with Transformer models. Their work connects the model's decision process, the inner working circuits, and the data representations to build a solid foundation for better ways to interpret neural networks.

mechanistic interpretabilityinterpretive equivalenceTransformer modelsneural networksmodel representationsalgorithmic interpretationscircuit analysisinterpretation evaluation
Authors
Alan Sun, Mariya Toneva
Abstract
Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.