Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots
2026-06-03 • Software Engineering
Software Engineering
AI summaryⓘ
The authors studied how chatbots used in places like healthcare and finance follow rules when answering questions. They found that many chatbots struggle with handling questions that involve multiple rules at once, a problem that hasn't been well tested before. To fix this, the authors created a tool called COPAL that generates tricky questions to see if chatbots break any rules when rules combine. Testing with COPAL showed that many chatbots made mistakes about one-third of the time on these complex questions, suggesting this issue needs more attention.
large language modelschatbotspolicy alignmentcomposed-policy violationquery generationinteraction patternsrule compliancemodel evaluationautomated testing
Authors
Yingjie Liu, Yongxiang Hu, Xuan Wang, Yilun Li, Yunlei Wei, Xiaoyu Wang, Yangfan Zhou
Abstract
Large language model chatbots are increasingly deployed in organizational settings such as healthcare, finance, and public services. Evaluating policy alignment is therefore critical to reliable chatbot deployment. By analyzing real-world user queries, we identify composed-policy violation is prevalent in various chatbots but overlooked by existing benchmarks. This paper present COPAL, an automated tool for evaluating composed-policy alignment in chatbots. COPAL efficiently generates queries that trigger composed-policy failures in chatbots via empirically derived interaction patterns and explicit handling contracts. Queries generated by COPAL expose substantial query handling failures: across 9 served models, composed-policy queries yield a 33.1% error rate on average, indicating that composed-policy alignment warrants further investigation.