Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
2026-04-10 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors created a method called SNCA to check if large language models (LLMs) actually follow the safety rules they say they do. They first get the models to explain their own safety rules, turn those into clear logical statements, and then test if the models behave accordingly. Their tests showed that models often say they completely refuse harmful requests but sometimes still respond harmfully. They also found that models vary a lot in how well they understand and enforce their rules, with some unable to state rules for many harm categories. This shows that what models say about their safety isn't always what they do, and that checking this consistency is important.
Large Language ModelsReinforcement Learning from Human Feedback (RLHF)Safety PoliciesBehavioral BenchmarksSymbolic-Neural Consistency Audit (SNCA)Typed PredicatesHarm CategoriesModel ComplianceSelf-stated RulesConsistency Audits
Authors
Avni Mittal
Abstract
LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.