Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

2026-06-02Computation and Language

Computation and Language
AI summary

The authors focus on safely removing specific knowledge from language models, which is important for security and rules. They found that past methods struggle because they ignore the embedding layer, where words are turned into numbers. To fix this, they created EMBER, a tool that targets and erases knowledge directly from these embeddings using a math technique called Sparse Matrix Factorization. Testing showed EMBER makes erasure stronger and harder to undo without harming the model's overall language skills. Their work suggests that dealing with embeddings is key to thoroughly removing certain knowledge from language models.

language modelsknowledge erasureembedding layerSparse Matrix Factorizationtoken embeddingsmodel parametersadversarial promptingrelearningcoherence lossconcept erasure
Authors
Clara Haya Suslik, Or Shafran, Mor Geva
Abstract
As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.