CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

2026-02-23 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created a new test called CausalFlip to check if large language models (LLMs) truly understand cause-and-effect relationships rather than just recognizing patterns in language. This test uses pairs of very similar questions that look alike but have opposite answers if you think about the actual causes. They also added a trick by putting extra unrelated text to see if models get confused. Their experiments showed that even methods designed to help reasoning can still be fooled by word patterns, but a new approach that makes the models think more about causes internally worked better. This suggests there is potential to help LLMs better grasp cause-and-effect beyond just pattern matching.

Large Language ModelsCausalityCausal ReasoningConfounderChain-of-ThoughtSemantic CorrelationBenchmarkColliderInternalized ReasoningSpurious Correlations

Authors

Yuzhe Wang, Yaochen Zhu, Jundong Li

Abstract

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.

View PDFOpen arXiv