BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
2026-04-03 • Computation and Language
Computation and Language
AI summaryⓘ
The authors discuss how large language models often give answers with too much confidence, even when it would be better to say "I don't know." They introduce a new way to measure how well a model's confidence helps decide when to answer or abstain, called the Behavioral Alignment Score (BAS). This score rewards models for being truthful about their confidence, especially by avoiding mistakes made with high confidence. They also show that usual measures miss some problems that BAS catches, and that simple fixes can improve confidence accuracy. Their work offers a better tool and benchmark for checking how reliable these model confidences really are.
large language modelsconfidence calibrationdecision-theoretic metricBehavioral Alignment Scoreabstentionoverconfidenceexpected utilityproper scoring rulespost-hoc calibrationevaluation metrics
Authors
Sean Wu, Fredrik K. Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, David A. Clifton
Abstract
Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.