Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

2026-02-24Machine Learning

Machine Learning
AI summary

The authors address the problem that usual Bayesian deep learning methods summarize uncertainty with a single number, mutual information (MI), which can't tell if uncertainty is about a harmless or dangerous class. They break MI into a per-class vector that better highlights uncertainty related to each class, especially rare or critical ones. They test this new measure on tasks like disease detection and out-of-distribution identification, showing it improves performance and reveals asymmetric uncertainty missed by MI. Their work also shows that the way uncertainty is modeled and passed through the network is as important as how it is measured.

Bayesian deep learningepistemic uncertaintymutual informationselective predictionout-of-distribution detectionaleatoric noiseposterior approximationentropyTaylor expansionsafety-critical classification
Authors
Mame Diarra Toure, David A. Stephens
Abstract
In safety-critical classification, the cost of failure is often asymmetric, yet Bayesian deep learning summarises epistemic uncertainty with a single scalar, mutual information (MI), that cannot distinguish whether a model's ignorance involves a benign or safety-critical class. We decompose MI into a per-class vector $C_k(x)=σ_k^{2}/(2μ_k)$, with $μ_k{=}\mathbb{E}[p_k]$ and $σ_k^2{=}\mathrm{Var}[p_k]$ across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the $1/μ_k$ weighting corrects boundary suppression and makes $C_k$ comparable across rare and common classes. By construction $\sum_k C_k \approx \mathrm{MI}$, and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of $C_k$, we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class $C_k$ reduces selective risk by 34.7\% over MI and 56.2\% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where $\sum_k C_k$ achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which $\sum_k C_k$ shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.