Online Safety Monitoring for LLMs

2026-07-02 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine Learning

AI summaryⓘ

The authors explain that even after special training, large language models (LLMs) can still sometimes produce unsafe or harmful outputs. To catch this, they tested a simple monitoring system that uses another model to check the outputs and raise an alarm if things seem unsafe. They set a threshold for when to alarm based on controlling risk and found that this straightforward method works just as well as more complicated systems in their tests with math problems and safety challenges. This suggests simple monitors can be effective for keeping LLMs safe in real time.

large language modelsalignment trainingsafety monitoringrisk controlthresholdingverifier modelred teamingmathematical reasoninghypothesis testing

Authors

Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron, Eric Nalisnick

Abstract

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

View PDFOpen arXiv