Differentially Private Language Generation and Identification in the Limit

2026-04-09 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageData Structures and AlgorithmsMachine Learning

AI summaryⓘ

The authors study how to generate and identify languages (sets of strings) while keeping the original data private, building on a recent model by Kleinberg and Mullainathan. They show that for countable language collections, privacy does not prevent the generation of valid strings eventually, but it requires more data samples in some cases. However, when trying to identify which language a stream belongs to, privacy creates serious limitations, especially if the languages overlap in complex ways. In scenarios where data is randomly sampled, private identification is possible only if it is possible without privacy. The authors highlight new differences between generating and identifying languages under privacy constraints and between adversarial and random data models.

language generationlanguage identificationdifferential privacycontinual release modelcountable collectionsadversarial modelstochastic modelsample complexityinfinite intersectionprivate learning

Authors

Anay Mehrotra, Grigoris Velegkas, Xifan Yu, Felix Zhou

Abstract

We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings while protecting the privacy of the entire input sequence. Our first main result is that for countable collections of languages, privacy comes at no qualitative cost: we provide an $\varepsilon$-differentially-private algorithm that generates in the limit from any countable collection. This stands in contrast to many learning settings where privacy renders learnability impossible. However, privacy does impose a quantitative cost: there are finite collections of size $k$ for which uniform private generation requires $Ω(k/\varepsilon)$ samples, whereas just one sample suffices non-privately. We then turn to the harder problem of language identification in the limit. Here, we show that privacy creates fundamental barriers. We prove that no $\varepsilon$-DP algorithm can identify a collection containing two languages with an infinite intersection and a finite set difference, a condition far stronger than the classical non-private characterization of identification. Next, we turn to the stochastic setting where the sample strings are sampled i.i.d. from a distribution (instead of being generated by an adversary). Here, we show that private identification is possible if and only if the collection is identifiable in the adversarial model. Together, our results establish new dimensions along which generation and identification differ and, for identification, a separation between adversarial and stochastic settings induced by privacy constraints.

View PDFOpen arXiv