What Makes a Good LLM Agent for Real-world Penetration Testing?

2026-02-19Cryptography and Security

Cryptography and SecuritySoftware Engineering
AI summary

The authors studied many AI systems designed to automatically find security problems in computer networks and found they fail in two main ways: some fail because they lack good tools or instructions, while others fail because they can't plan or keep track of tasks well. They discovered the harder type of failure happens because the AI can't judge how difficult different tasks are, causing it to waste effort on unimportant steps. To fix this, the authors created Excalibur, an agent that uses better tools and a system to estimate task difficulty to guide its actions. Excalibur performed much better than previous methods on various tests, showing that understanding task difficulty helps improve these AI agents beyond just making the models bigger.

LLM-based agentspenetration testingtask difficulty estimationplanning and state managementtoolingretrieval-augmented knowledgeattack chainEvidence-Guided Attack Tree Search (EGATS)Capture The Flag (CTF) benchmarksActive Directory environment
Authors
Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, Tianwei Zhang
Abstract
LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation. As a result, agents misallocate effort, over-commit to low-value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing agent that couples strong tooling with difficulty-aware planning. A Tool and Skill Layer eliminates Type A failures through typed interfaces and retrieval-augmented knowledge. A Task Difficulty Assessment (TDA) mechanism addresses Type B failures by estimating tractability through four measurable dimensions (horizon estimation, evidence confidence, context load, and historical success) and uses these estimates to guide exploration-exploitation decisions within an Evidence-Guided Attack Tree Search (EGATS) framework. Excalibur achieves up to 91% task completion on CTF benchmarks with frontier models (39 to 49% relative improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 by prior systems. These results show that difficulty-aware planning yields consistent end-to-end gains across models and addresses a limitation that model scaling alone does not eliminate.