AI summaryⓘ
The authors found that when language models are fine-tuned on harmless tasks, their built-in safety features can unexpectedly get worse, even without harmful data or bad intentions. They explain that the common belief—that fine-tuning changes keep clear of important safety directions—is misleading because the training process naturally moves the model into risky zones due to the curved shape of the problem. Their work shows this happens because safety depends on a fragile, curved part of the model’s parameters, which normal training methods can’t easily avoid or fix. They suggest that understanding this geometry can help develop better techniques to keep models safe during fine-tuning, moving from fixing problems after they happen to preventing them beforehand.
fine-tuningaligned language modelssafety guardrailsgradient descentparameter spacecurvaturesecond-order methodsalignment instabilityloss landscapegeometric analysis
Authors
Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova
Abstract
Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.