Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

2026-06-03Computation and Language

Computation and LanguageInformation Retrieval
AI summary

The authors studied whether large language models (LLMs) really understand cause-and-effect or just recognize word patterns. They created a test called Caliper that hides real words with placeholders but keeps the question's logic the same. When tested this way, all models showed big drops in accuracy, meaning they rely a lot on word clues instead of true causal reasoning. Techniques like giving examples can help a little, but overall, the authors found that current LLMs don’t demonstrate strong structural causal reasoning without lexical hints.

large language modelscausal reasoningCaliperlexical anonymizationinstruction tuningzero-shot learningfew-shot learningcausal graphprobabilistic specificationstructural reasoning
Authors
Zhenyu Yu, Shuigeng Zhou
Abstract
Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.