ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
2026-04-13 • Cryptography and Security
Cryptography and SecurityArtificial Intelligence
AI summaryⓘ
The authors identify a security problem where language models that use tools can be tricked by attackers hiding bad instructions inside the tool's output. To fix this, they created ClawGuard, a system that checks and confirms rules every time a tool is called, stopping harmful instructions before they affect anything. Their method works without changing the language models or infrastructure and successfully defends against different types of attacks while still letting the agents do their tasks. Experiments showed ClawGuard can keep agents safe without losing usefulness.
Large Language ModelsPrompt InjectionTool-Augmented AgentsRuntime SecurityTool-Call BoundaryAccess ConstraintsAdversarial AttacksAgentDojoSkillInjectMCPSafeBench
Authors
Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun
Abstract
Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.