From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning

2026-06-18Cryptography and Security

Cryptography and Security
AI summary

The authors show that in federated learning, where multiple users collaboratively improve a language model without sharing their raw data, a bad server can secretly hide a kind of privacy attack called a backdoor in the model updates. Their method, NeuroImprint, stores each user's training example in a single neuron, allowing the hidden data to be recovered later without hurting the model's normal performance. They tested this on several language models and datasets and found the attack could accurately reconstruct over half of the private training samples. This means even parameter-efficient fine-tuning, which only adjusts small parts of the model, can still leak sensitive information if the server is malicious.

Federated LearningParameter-Efficient Fine-TuningPrivacy BackdoorLanguage ModelsNeuronsData MemorizationModel InversionAdam OptimizerBERTGPT-2
Authors
Shanghao Shi, Chaoyu Zhang, Heng Jin, Yang Xiao, Yevgeniy Vorobeychik, William Yeoh, Ning Zhang, Y. Thomas Hou, Wenjing Lou
Abstract
Federated learning (FL) enables multiple parties to collaboratively fine-tune language models for domain-specific tasks without sharing raw data. Since full model fine-tuning is often prohibitively expensive for FL clients, parameter-efficient fine-tuning (PEFT) has become the de facto approach in practice, freezing the base model and training only a small set of adapters. In this paper, we show that a malicious parameter server can stealthily corrupt a PEFT adapter into a privacy backdoor that implicitly memorizes the client's training samples as isolated per-sample parameter updates stored in separate neurons, without degrading model utility. Concretely, our attack, NeuroImprint, assigns a dedicated memorization neuron to each training sample and constrains that each neuron is updated at most once along the local fine-tuning trajectory. This design mitigates both cross-sample collisions and cross-step mixing introduced by large local batches and stateful optimizers (e.g., Adam/AdamW) in language-model fine-tuning. After fine-tuning, the resulting isolated per-sample updates can be analytically inverted in closed form to recover text embeddings, which are then deterministically mapped back to token sequences. To understand the generality of our method, we implemented NeuroImprint on multiple language models (BERT, GPT-2, Qwen2, and Llama3.2) and evaluated it across four fine-tuning datasets spanning diverse domains. The results demonstrate that our attack can reconstruct 59% to 79% of all finetuning samples with high semantic fidelity.