Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

2026-05-07Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors introduce a new way to explain how deep learning models make decisions by focusing on meaningful concepts instead of just raw input features like pixels. They combine ideas from two fields: one that explains models using high-level concepts and another that finds the smallest set of input parts causing an outcome. Their method finds the minimal sets of concepts that truly cause a model's prediction and can explain decisions for single images or groups showing similar behavior. They test their approach on various models and datasets and show it provides clear, useful explanations.

concept-based explanationsdeep neural networkscausal explanationsabductive explanationscontrastive explanationsconcept erasuremodel interpretabilityminimal explanationshigh-level conceptsbehavior analysis
Authors
Ronaldo Canizales, Divya Gopinath, Corina Păsăreanu, Ravi Mangal
Abstract
*Concept-based explanations* offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and model predictions or are limited in expressivity and only able to infer causal explanations involving single concepts. At the same time, the parallel line of work on *formal abductive and contrastive explanations* computes the minimal set of input features causally relevant for model outcomes but only considers low-level features such as pixels. Merging these two threads, in this work, we propose the notion of *concept-based abductive and contrastive explanations* that capture the minimal sets of high-level concepts causally relevant for model outcomes. We then present a family of algorithms that enumerate all minimal explanations while using *concept erasure* procedures to establish causal relationships. By appropriately aggregating such explanations, we are not only able to understand model predictions on individual images but also on collections of images where the model exhibits a user-specified, common *behavior*. We evaluate our approach on multiple models, datasets, and behaviors, and demonstrate its effectiveness in computing helpful, user-friendly explanations.