MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?
2026-06-22 • Machine Learning
Machine LearningMultiagent Systems
AI summaryⓘ
The authors studied systems made up of multiple AI agents, each guided by instructions called system prompts. These prompts help define the agents' roles and how they work together, without needing to change the AI models themselves. They explored how optimizing these prompts affects overall system performance, especially since coordinating many agents is more complex than just one. Their experiments showed that improving prompts can lead to better results, but the benefits depend on factors like tasks and team setup. They also identified new challenges in optimizing prompts for multiple cooperating agents.
multi-agent systemslarge language modelssystem promptsprompt optimizationworkflowinter-agent coordinationcommunication protocoloutput aggregationmodel finetuning
Authors
Juyang Bai, Laixi Shi
Abstract
Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position within a workflow that governs inter-agent coordination and output aggregation. System prompts thus form a critical and accessible optimization surface: they specify agents' roles and behaviors, enabling system-level improvements without model finetuning. Although prompt optimization has shown substantial potential for single LLMs, extending it to MAS poses distinct challenges, notably an exponentially growing search space. It remains unclear whether, when, and by how much prompt optimization improves MAS performance, and how sensitive such gains are to system configuration. In this work, we systematically study system-prompt optimization across a broad range of MAS setups varying in task, workflow, communication protocol, and team size, benchmarking two prompt optimizers that naturally extend state-of-the-art single-agent methods. The results reveal its potential to unlock significant gains while exposing open challenges, characterizing when and how much prompt optimization helps across diverse MAS settings.