Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

2026-04-08Machine Learning

Machine LearningComputation and Language
AI summary

The authors propose a new method called Guardian-as-an-Advisor (GaaA) to improve safety checks for AI models without overly blocking useful responses. Instead of hard-blocking risky inputs, their system predicts if input is risky and gives a short explanation which is added back to the query for the AI to reconsider. They also created a large dataset, GuardSet, to train and test their approach for both harmful and harmless content. Their trained system, GuardAdvisor, balances detecting risky inputs accurately while keeping the original AI model's behavior intact and only slightly increases processing time. This helps AI models follow safety rules more reliably and usefully.

hard-gated safety checkerssoft-gatingrisk detectionmodel specificationreinforcement learningSFT (supervised fine-tuning)robustnessexplanation consistencydatasetlatency
Authors
Yue Huang, Haomin Zhuang, Jiayi Ye, Han Bao, Yanbo Wang, Hang Hua, Siyuan Wu, Pin-Yu Chen, Xiangliang Zhang
Abstract
Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.