AI summaryⓘ
The authors discuss challenges in understanding how large language models (LLMs) work, especially when researchers draw conclusions that don't always apply broadly or claim causes without enough proof. They suggest using ideas from causal inference to clearly define what experimental evidence is needed to link model parts to behavior in a reliable way. By applying Pearl's causal hierarchy, they explain what kinds of claims about models are justified, distinguishing between simple observations, interventions, and more complex counterfactual reasoning. Their work introduces a framework based on causal representation learning that helps researchers choose the right methods to ensure findings are trustworthy and generalizable.
Large Language ModelsInterpretabilityCausal InferencePearl's Causal HierarchyInterventionsCounterfactualsActivation PatchingAblationCausal Representation LearningModel Behavior
Authors
Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar
Abstract
Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for the same prompt under an unobserved intervention -- remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.