Efficient Refusal Ablation in LLM through Optimal Transport

2026-03-04Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how language models refuse harmful requests by looking at patterns inside the models called refusal behaviors. They improved previous hacking methods—which tried removing refusal signals in a simple way—by transforming the entire range of harmful signals to look like safe ones using a math concept called optimal transport. Testing on six different models, their approach was more effective at bypassing safety without hurting the model's overall performance. They also found that targeting specific layers in the model worked better than changing everything, suggesting refusal signals are focused in certain parts. This work shows current safety methods might be vulnerable to more complex attacks.

language modelsrefusal behaviorsactivation-based jailbreakingoptimal transportprincipal component analysis (PCA)distributional attacksmodel activationslayer-selective interventionmodel alignmentgeometric structure
Authors
Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob
Abstract
Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.