Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

2026-06-03Computation and Language

Computation and Language
AI summary

The authors show that current methods to trick AI language models (LLMs) are easy to spot and fix because they rely on specific, unusual prompts. Instead, they found a broader way to bypass safety by using certain styles of real human writing, specifically fanfiction genres, to hide harmful commands naturally in creative stories. This technique works across multiple models without needing special adjustments, greatly increasing how often the models are tricked. They also tested some defenses and found these actually push attackers to use more of these disguised writing styles. Finally, they created a more advanced method that further boosts attack success over previous multi-turn approaches.

jailbreaklarge language modelsalignmentArchive of Our Ownfanfictionattack success ratesafety trainingmeta-promptingmulti-turn attack
Authors
Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang
Abstract
Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.