Can LLMs Reliably Self-Report Adversarial Prefills, and How?

2026-06-22 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied whether large language models (LLMs) can realize when their previous answers were influenced by tricky, harmful prompts designed to manipulate them. They found that across many models and safety tests, the models usually could not reliably detect when their responses were compromised, only recognizing this about 27% of the time. The models' ability to self-check was mainly linked to reasoning about safety or refusing to answer. The authors also showed that tweaking model training methods changed how well models could report their own intentions, but this did not fully stop attacks and sometimes made them more successful. Overall, the research highlights that LLMs' self-awareness about being manipulated is limited and unreliable.

large language modelsintrospectionadversarial prefill attackinstruction-tuningsafety benchmarksrefusal mechanismLoRA finetuningself-report reliabilityadversarial robustnessintention probing

Authors

Quang Minh Nguyen, Uzair Ahmed, Taegyoon Kim

Abstract

Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.

View PDFOpen arXiv