Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

2026-03-04 • Machine Learning

Machine LearningArtificial IntelligenceCryptography and SecurityMultiagent Systems

AI summaryⓘ

The authors discuss challenges in training large language models that work as independent agents, where the training can become unstable because of complex policy behaviors. They point out that usual methods to fix this, which limit changes in all directions, are too strict and reduce model performance. To improve this, the authors propose Adversarially-Aligned Jacobian Regularization (AAJR), which only limits sensitivity in directions that matter for adversarial attacks. Their method allows more flexible policies and helps keep training stable without sacrificing much performance. They also provide mathematical proof showing their approach is better suited for robust multi-agent training than traditional methods.

Large Language Modelsminimax trainingpolicy sensitivityJacobian regularizationadversarial trainingoptimization stabilityinner maximizationmulti-agent systemsrobustnessstep-size conditions

Authors

Furkan Mumcu, Yasin Yilmaz

Abstract

As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing sensitivity in all directions and inducing a large Price of Robustness. We introduce Adversarially-Aligned Jacobian Regularization (AAJR), a trajectory-aligned approach that controls sensitivity strictly along adversarial ascent directions. We prove that AAJR yields a strictly larger admissible policy class than global constraints under mild conditions, implying a weakly smaller approximation gap and reduced nominal performance degradation. Furthermore, we derive step-size conditions under which AAJR controls effective smoothness along optimization trajectories and ensures inner-loop stability. These results provide a structural theory for agentic robustness that decouples minimax stability from global expressivity restrictions.

View PDFOpen arXiv