Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

2026-07-01 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine LearningLogic in Computer ScienceSoftware Engineering

AI summaryⓘ

The authors introduce Theoria, a system that checks AI answers by breaking them down into clear, auditable steps with explicit reasons for each change. This method helps catch hidden assumptions or fake information that other AI judges might miss. When tested on expert problems and tricky, deceptive cases, Theoria showed higher accuracy and transparency compared to holistic AI judges. It produces a readable proof trail, allowing anyone to verify or challenge each part of the answer.

AI verificationformal proof assistantLLM judgetyped state transitionsauditable proofhidden premisesprecisionWilson confidence intervaladversarial proofsproof trace

Authors

Ben Slivinski, Michael Saldivar

Abstract

When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

View PDFOpen arXiv