Towards a Science of AI Agent Reliability
2026-02-18 • Artificial Intelligence
Artificial IntelligenceComputers and SocietyMachine Learning
AI summaryⓘ
The authors explain that current tests for AI agents often only use one score to say if they succeed or fail, which misses important problems like how consistent or safe the agents really are. They propose 12 new ways to measure how reliable AI agents are, focusing on if they behave consistently, handle changes well, fail in predictable ways, and stay safe. Testing 14 AI models, they found that even recent improvements haven't fixed many reliability issues. Their work helps give a fuller picture of how AI agents perform and fail, beyond just simple success scores.
AI agentsreliability metricsconsistencyrobustnesspredictabilitysafetybenchmarkingagent evaluationerror severityperformance profiling
Authors
Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan
Abstract
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.