Hybrid Adversarial Defence for Natural Language Understanding Tasks

2026-06-03Computation and Language

Computation and Language
AI summary

The authors studied how to better protect large language models from making things up (hallucinations) and from tricky attacks (adversarial manipulation). They combined three types of methods—entropy-based, uncertainty-based, and geometric-based—to build a stronger defense. Their combined approach improved the models' accuracy and resistance to attacks on several standard and new datasets. This approach also worked well for detecting prompt injections and jailbreak attacks, showing it is better than using just one method alone.

Large Language Modelshallucinationadversarial attacksentropyuncertaintygeometric featuresnatural language understandingprompt injectionjailbreak detectionout-of-distribution
Authors
Manar Abouzaid, Yang Wang, Chenghua Lin, Stuart E. Middleton
Abstract
Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.