DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

2026-06-02Software Engineering

Software EngineeringArtificial Intelligence
AI summary

The authors study how large language models (LLMs) sometimes refuse to answer safe questions because they look risky—a problem called overrefusal. They created DDOR, a method that automatically finds tiny parts of the input that cause the refusal and explains why. DDOR then makes new test prompts around these parts to check if the refusal is necessary or not, helping to fix unnecessary refusals while keeping the model safe from harmful content. This approach works without needing to look inside the model and improves how usable the LLM is.

large language modelsoverrefusalsafety alignmentdelta debuggingprompt engineeringblack-box testingmodel evaluationprompt repairmulti-oracle validation
Authors
Qinyan Zhou, Peixin Zhang, Jun Sun, Haonan Zhang, Dongxia Wang
Abstract
While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.