SAGE: A Service Agent Graph-guided Evaluation Benchmark

2026-04-10Artificial Intelligence

Artificial Intelligence
AI summary

The authors created SAGE, a new way to test how well large language models (LLMs) perform in customer service tasks by using dynamic dialogue graphs based on real procedures. Their method checks not just if models understand user requests, but also if they follow correct steps in conversations like humans should. They tested 27 models in different industries and found that while models often recognize user intents, they struggle to take the right next actions, showing a gap between understanding and execution. They also noticed that models can stay polite even when they make logical mistakes under tough conditions.

Large Language Modelscustomer service automationbenchmarkingStandard Operating Proceduresdynamic dialogue graphsintent classificationadversarial intentevaluation frameworklogical compliance
Authors
Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang, Chaozheng Wang, Yujie Wang, Wei He, Jinpeng Wang, Deiyi Xiong
Abstract
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.