Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

2026-05-27Computation and Language

Computation and Language
AI summary

The authors studied whether large language models (LLMs) understand and use words like "likely" or "maybe" to show how confident they are about their answers. They found that LLMs don't consistently link these confidence words to clear levels of certainty, even though they somewhat keep a similar order of confidence across different tasks. This means that while models use these words, they are not always reliable indicators of how sure the model actually is. The authors suggest that making these confidence expressions more stable and aligned could help improve trust in LLMs.

Large Language ModelsEpistemic MarkersConfidence CalibrationIntrinsic UncertaintyMarker Internal ConfidenceModel CalibrationNatural Language ProcessingTrustworthinessModel Reliability
Authors
Gabrielle Kaili-May Liu, Arman Cohan
Abstract
LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.