Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

2026-05-11Machine Learning

Machine Learning
AI summary

The authors study Guardrail Classifiers, which are designed to stop language models from harmful outputs but usually lack formal safety guarantees. They propose a new way to verify safety by examining the model's internal representations rather than just input text, defining harmful regions as shapes that include known harmful examples. Using this method, they tested three classifiers and found that, despite good test results, real safety gaps exist. They also show differences in how models like GPT-2, Llama, and BERT represent harmful content, with BERT having weaker safety margins. Their work gives a clearer, more formal understanding of how reliable these safety classifiers really are.

Guardrail Classifierslanguage modelsformal verificationpre-activation spaceharmful behaviorconvex regionsigmoid classificationSVDGaussian Mixture Modelstoxicity detection
Authors
Nikita Kezins, Urbas Ekka, Pascal Berrang, Luca Arnaboldi
Abstract
Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.