AI summaryⓘ
The authors studied how large language models (LLMs) can accidentally remember private information from their training data and the need to remove this information reliably. They created LACUNA, a test system that puts known private data into specific parts of a model's memory to check if unlearning methods truly erase that information or just hide it. Using LACUNA, they found that current unlearning techniques often fail to precisely remove the private info and can be tricked into revealing it again. Their results suggest that accurately targeting the exact parts of a model storing sensitive data is crucial for effective unlearning. They provide LACUNA to help researchers improve unlearning methods in the future.
large language modelsunlearningpersonally identifiable informationparameter localizationmasked continual pretrainingresurfacing attacksmodel weightsOLMo modelsgradient-based unlearning
Authors
Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach, Verna Dankers
Abstract
LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.