Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification

2026-02-27Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how to simplify neural networks while keeping their core cause-and-effect behavior the same. They treated the network like a clear set of cause-effect rules and created a math formula to decide which parts of the network can be removed or merged without changing these rules. Their method improves upon older ways that just looked at activity levels, making it easier to find simpler models that still behave like the original. They tested their approach by checking if the simplified models respond correctly when parts are changed.

Neural networksCausal abstractionStructural Causal ModelInterventionsStructured pruningActivation varianceInterventional RiskDeterministic SCMModel simplificationInterchange interventions
Authors
Amir Asiaee
Abstract
Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction -- a simpler, high-level Structural Causal Model (SCM) faithful to the network under interventions. Discovering such abstractions is hard: it typically demands brute-force interchange interventions or retraining. We reframe the problem by viewing structured pruning as a search over approximate abstractions. Treating a trained network as a deterministic SCM, we derive an Interventional Risk objective whose second-order expansion yields closed-form criteria for replacing units with constants or folding them into neighbors. Under uniform curvature, our score reduces to activation variance, recovering variance-based pruning as a special case while clarifying when it fails. The resulting procedure efficiently extracts sparse, intervention-faithful abstractions from pretrained networks, which we validate via interchange interventions.