Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-03-04 • Machine Learning
Machine LearningArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors studied web agents that use both images and text to understand webpages and found that attackers can trick these agents more easily when they manipulate both the visual and text parts together. They created a new training method called DMAST, which teaches the agents to better handle confusing or misleading input by simulating attacks and learning from a strong teacher and through self-play. This training made the agents safer against attacks and more efficient at completing tasks, especially on new types of webpages they hadn’t seen before. Their method works better than previous defense techniques.
multimodal web agentsaccessibility treesadversarial attacksimitation learningreinforcement learningMarkov gameself-playzero-acknowledgment strategytask completion efficiencyout-of-distribution generalization
Authors
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng
Abstract
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.