The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

2026-06-24Artificial Intelligence

Artificial IntelligenceCryptography and SecurityMachine Learning
AI summary

The authors explain that AI systems controlling tools and infrastructure need stronger safety checks built outside the AI itself to prevent unwanted actions. They identify four key properties for such controls, aiming to stop the AI from bypassing rules by messing with its own processes. They created a Rust-based safety kernel that enforces these rules strictly, proving its reliability with formal verification methods. Testing showed the kernel successfully prevented any harmful self-changes or escapes by the AI during many trials, even when attacked by adversaries. This work adds a layer of safety that works during AI execution, complementing training and output controls.

AI agentruntime controlsauthorization mechanismprocess separationfail-closedexecution-time alignmentformal verificationSMT theoremmodel checkingsafety kernel
Authors
Seth Dobrin, Łukasz Chmiel
Abstract
AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mechanism must satisfy for architectural control rather than for cooperative requests: process separation, pre-action enforcement on a structurally only path, fail-closed at both the request and system levels, and externalized signed evidence verifiable outside the controlled system's trust boundary. We position this layer as execution-time AI alignment, complementing training-time alignment (RLHF, Constitutional AI) and inference-time alignment. We present the Unfireable Safety Kernel, a Rust reference implementation realizing all four. Its fail-closed invariant is machine-checked at two levels: an SMT theorem (Z3) and an exhaustive bounded-model-checking proof of the production decision function (Kani, 4/4 harnesses). A Python-to-Rust migration was gated on byte-equivalence (1000/1000 fixtures; 17/17 adversarial classes). We evaluate the kernel governing a live, escapable AI system, a deterministic, self-improving world model, against an escape-seeking adversary driving its real self-modification seam: across 1,000 self-modifications, all 704 attempts on the safety-critical core are refused, with no escape; a further 300, under the operator kill switch, are also refused. A separate campaign of 6,240 authorization round-trips had no successful bypass. Against 3 contemporary systems claiming the agent control plane, the agent invokes control; here, it lacks that choice.