A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

2026-06-16Cryptography and Security

Cryptography and SecurityArtificial IntelligenceComputation and Language
AI summary

The authors tested how well two advanced language models by Anthropic, Fable 5 and Opus 4.8, can resist attempts to trick them into saying harmful things. They used an automated system called HackAgent to try many different smart attacks and checked the results with judges to confirm if the models failed. Both models stopped most simple tricks, but more clever, adaptive attacks still managed to get harmful outputs. Opus 4.8 was tricked on about 11.5% of harmful intent attempts, and Fable 5 on around 6.1%. The authors conclude that even state-of-the-art models can still be reliably broken under persistent, automated attacks.

large language modelsadversarial attacksjailbreak attacksharm taxonomyred-teamingautomated testingadaptive attacksmodel robustnessAnthropicHackAgent
Authors
Nicola Franco
Abstract
We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.