The First Token Knows: Single-Decode Confidence for Hallucination Detection

2026-05-06Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors studied methods to detect when a language model makes mistakes or ‘hallucinates’ answers. They found that simply looking at how confident the model is in choosing the very first word of its answer (called phi_first) works as well or better than more complex methods that compare many different answers. This approach uses just one answer, making it faster and cheaper. Their results suggest that much of the uncertainty information is already in the model’s initial word choice, so phi_first can be a simple baseline for detecting unreliable answers.

hallucination detectionself-consistencylanguage modelsnatural language inferencelogitsentropyAUROCgreedy decodinginstruction-tuned models
Authors
Mina Gabriel
Abstract
Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.