PURGE: Projected Unlearning via Retain-Guided Erasure

2026-06-02Machine Learning

Machine LearningArtificial IntelligenceCryptography and Security
AI summary

The authors present PURGE, a method to help a machine "forget" specific data without losing performance on what it should remember. They treat forgetting and learning as opposite problems and use a technique to keep the model's accuracy steady while erasing targeted information inside its layers. Instead of forcing the model to output completely random results for forgotten data, they make it behave like it’s naturally uncertain, making it harder to tell what was removed. They also built automatic rules to stop the process at the best time, avoiding extra tuning. Their tests show PURGE works well on several image datasets, maintaining accuracy and strong privacy protection.

machine unlearningcontinual learninggradient projectionA-GEMmembership inference attackrepresentation erasureretain accuracyforget-setprivacy-utility tradeoffstopping criteria
Authors
Vedant Jawandhia, Daksh Ahuja, Ghufran Alam Siddiqui, Prashant Trivedi, Yash Sinha, Pratik Narang
Abstract
We propose PURGE, a machine unlearning algorithm built on a simple but an under-exploited observation: continual learning (CL) and machine unlearning (MU) which are fundamentally dual problems. CL tries to learn new tasks without forgetting old ones; MU tries to erase specific data without hurting retained performance representing the same underlying tension in opposite directions. PURGE leverages this duality by adapting gradient projection from A-GEM (Chaudhry et al., 2019) so that every unlearning step is constrained to not increase the retain-set loss. On top of this, it performs multi-layer representation erasure, pushing forget-set activations in intermediate layers towards the retain distribution to remove information from hidden representations rather than just suppressing it at the output. A key design choice is the retain-confusion target: rather than pushing forget outputs toward the uniform distribution, which we found to be surprisingly easy for membership inference attacks to detect, we instead target the model's natural confusion pattern on retain data. This makes the unlearned model hard to distinguish from one retrained from scratch. Two self-regulating stopping criteria (a retain-loss budget and a forget-accuracy target) let the algorithm decide on its own when to stop, removing the need for manual epoch tuning. In experiments on five datasets (CIFAR-10, MNIST, SVHN, STL10, PathMNIST) across 22 class-level forgetting tasks, PURGE consistently keeps retain accuracy above 96% while achieving MIA AUROC close to 0.5 (the ideal), outperforming gradient ascent, KL-uniform, and several published baselines on the privacy-utility frontier.