AI summaryⓘ
The authors studied how reducing the precision of a large language model (LLM) changes its ability to judge its own knowledge across different topics. They found that quantisation (lower precision) changes metacognitive efficiency unevenly across domains, for example making Arts & Literature better monitored but Geography worse. However, the model's ability to distinguish correct from incorrect answers stayed stable, suggesting the change is due to how results are normalized. Attempts to improve weak domains with targeted training did not work, highlighting that metacognitive measures depend on the model's inference format. They recommend using AUROC_2 as a more reliable metric and provide all their data and code for transparency.
model quantisationmetacognitive efficiencyLLM (Large Language Model)M-ratioType-2 AUROCinference formatconfidence-amplificationdomain-conditional trainingmetacognitionprecision formats (Q5_K_M, f16)
Abstract
We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts & Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d' because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.