When Autoregressive Consistency Hurts Safety Alignment

2026-06-02Machine Learning

Machine LearningCryptography and Security
AI summary

The authors explain that the way large language models (LLMs) are made safe is often weak because the safety training mostly affects just the very beginning of their responses. They show this happens because LLMs tend to predict the next word by sticking closely to the recent text they've generated, a pattern called autoregressive consistency. The authors demonstrate that this can be exploited by attacks that insert harmful words later in the output, which then steer the model toward bad behavior despite safe starts. To fix this, they suggest training methods that focus on preventing bad continuations anywhere in the response, not just at the start. Their work highlights the importance of considering autoregressive consistency for both protecting and attacking LLMs.

large language modelssafety alignmentfine-tuningautoregressive consistencynext-token predictionlearning dynamicsadversarial attacksrandom insertion attackharmful continuation statesworst-case training
Authors
Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu
Abstract
Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood through autoregressive consistency, the tendency of next-token prediction to preserve and extend the current response trajectory consistently. By analyzing the learning dynamics of safety alignment, we show that autoregressive consistency can concentrate alignment updates on early tokens, offering a mechanistic explanation for shallow safety alignment. The same mechanism also predicts a broader class of attacks on LLMs: attacks that induce harmful continuation states at arbitrary positions in the output trajectory. As a concrete example, we introduce random insertion attack, which inserts a short harmful span into an otherwise safe refusal trajectory and exploits autoregressive consistency to sustain the resulting harmful branch, thereby bypassing safety alignment. Notably, a short harmful span can redirect the generation to be harmful even after a long refusal prefix, highlighting autoregressive consistency as a potential broader failure mechanism. This suggests that safety alignment should also break harmful autoregressive consistency throughout the output trajectory. We therefore propose adversarial safety alignment, an initial framework based on worst-case harmful continuation states, and instantiate it with random worst-insertion training. Overall, our results suggest that autoregressive consistency should be treated as a central consideration in both safety alignment and attack design.