Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

2026-06-30 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors focus on improving large language models' (LLMs) ability to understand and judge their own performance, called metacognition, which current models struggle with by often being too confident or incorrect. They introduce two methods: reinforcement learning with metacognitive feedback (RLMF), which trains models using their own performance judgments, and metacognitive data selection, which picks better training examples based on these self-assessments. These methods help calibrate how well a model's confidence matches its actual knowledge, making it better at expressing uncertainty in a clear way. Their experiments show that RLMF outperforms traditional reinforcement learning and helps models better recognize their limits. Overall, the authors suggest that using a model's self-awareness as feedback can improve its reliability and alignment.

MetacognitionLarge Language Models (LLMs)Reinforcement LearningPreference OptimizationCalibrationUncertainty EstimationActive LearningSelf-assessmentFaithful CalibrationIntrinsic Feedback

Authors

Gabrielle Kaili-May Liu, Avi Caciularu, Gal Yona, Idan Szpektor, Arman Cohan

Abstract

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.

View PDFOpen arXiv