ER-AE: Differentially Private Text Generation for Authorship Anonymization

Bo, Haohan; Ding, Steven H. H.; Fung, Benjamin C. M.; Iqbal, Farkhund

doi:10.18653/v1/2021.naacl-main.314

Cited by 10 publications

(11 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Section 2.1) is a notion of privacy based on randomness, i.e., any non-trivial DP mechanism must be non-deterministic. In the context of textual data, existing DP mechanisms typically operate on a word-by-word basis [6,[23][24][25]63] by randomly determining each word in the output sequence. The common ground of these methods is to impose a probability distribution, locally at each position, over the words in the output space (or similarly, over the embedding space) and draw a word (or corresponding embedding) from that distribution.…”

Section: Differentially Private Inferencementioning

confidence: 99%

“…Weggenmann and Kerschbaum [63] and Fernandes et al [23] approach anonymization through applying DP mechanisms on document term frequency vectors, obtaining differentially private text representations that are, however, not human-readable. Other methods relying on DP mechanisms apply these on a word-level by perturbing words or their embeddings and thus exchanging individual tokens in a given text [6,24,25]. Deep Learning based approaches for authorship obfuscation include back-translation [37,64] as well as the incorporation of GANs as discriminators [55].…”

Section: Related Workmentioning

confidence: 99%

“…While many DP mechanisms are readily available for structured data such as numbers or vectors, it is not always easy to apply DP to unstructured data such as text, which comes in varying lengths and with different ways to express the same idea. Existing DP approaches for text work on a per-word level [6,[23][24][25]63], thus obfuscating the original input while keeping it statistically relevant overall. However, each of these approaches has one or more undesired drawbacks: the output is a vector or Bag-of-Words (BoW) representation and thus not human-readable, the produced text is incoherent due to the words being perturbed individually, the DP guarantees only apply to texts of the same length, and/or the privacy loss 𝜖 grows linearly with the length of the output.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Weggenmann¹,

Rublack²,

Andrejczuk

et al. 2022

Proceedings of the ACM Web Conference 2022

View full text Add to dashboard Cite

While vast amounts of personal data are shared daily on public online platforms and used by companies and analysts to gain valuable insights, privacy concerns are also on the rise: Modern authorship attribution techniques have proven effective at identifying individuals from their data, such as their writing style or behavior of picking and judging movies. It is hence crucial to develop data sanitization methods that allow sharing of users' data while protecting their privacy and preserving quality and content of the original data.In this paper, we tackle anonymization of textual data and propose an end-to-end differentially private variational autoencoder architecture. Unlike previous approaches that achieve differential privacy on a per-word level through individual perturbations, our solution works at an abstract level by perturbing the latent vectors that provide a global summary of the input texts. Decoding an obfuscated latent vector thus not only allows our model to produce coherent, high-quality output text that is human-readable, but also results in strong anonymization due to the diversity of the produced data. We evaluate our approach on IMDb movie and Yelp business reviews, confirming its anonymization capabilities and preservation of the semantics and utility of the original sentences.

show abstract

Section: Differentially Private Inferencementioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Weggenmann¹,

Rublack²,

Andrejczuk

et al. 2022

Proceedings of the ACM Web Conference 2022

View full text Add to dashboard Cite

show abstract

“…While not humanreadable, these vector representations can be shared for automated processing, such as topic or sentiment inference and machine learning. To generate human-readable text, Bo et al (2021) employ an encoder-decoder model similar to ours, but without paraphrasing, and sample output words using (a two-set variant of) the Exponential mechanism (McSherry and Talwar, 2007). Weggenmann et al (2022) propose a differentially private variation of the variational autoencoder and use it as a sequenceto-sequence architecture for text anonymization.…”

Section: Related Workmentioning

confidence: 99%

“…Previous work in the field of authorship obfuscation mainly focuses on two different tasks, namely learning anonymous textual vector representations for downstream tasks (Coavoux et al, 2018a;Weggenmann and Kerschbaum, 2018;Fernandes et al, 2019;Mosallanezhad et al, 2019;Beigi et al, 2019) and the development of mechanisms that transform the input sentence to remove properties revealing the author and thus output human-readable text. Works within the second category (Feyisetan et al, 2019(Feyisetan et al, , 2020Xu et al, 2020b;Bo et al, 2021) typically follow a common word level framework which is characterized by the differentially private individual perturbation of word embeddings and the subsequent sampling of new words that are close to the perturbed vectors in the embedding space. Also, the majority of recent work proposing new methods for authorship obfuscation deals with the optimization and calibration of noise sampling mechanisms (Xu et al, 2020a) or the definition of new distributions to sample noise from (Feyisetan et al, 2019) as opposed to the development of entirely new methods.…”

Section: Introductionmentioning

confidence: 99%

The Limits of Word Level Differential Privacy

Justus¹,

Weggenmann²,

Kerschbaum³

2022

Preprint

View full text Add to dashboard Cite

As the issues of privacy and trust are receiving increasing attention within the research community, various attempts have been made to anonymize textual data. A significant subset of these approaches incorporate differentially private mechanisms to perturb word embeddings, thus replacing individual words in a sentence. While these methods represent very important contributions, have various advantages over other techniques and do show anonymization capabilities, they have several shortcomings. In this paper, we investigate these weaknesses and demonstrate significant mathematical constraints diminishing the theoretical privacy guarantee as well as major practical shortcomings with regard to the protection against deanonymization attacks, the preservation of content of the original sentences as well as the quality of the language output. Finally, we propose a new method for text anonymization based on transformer based language models fine-tuned for paraphrasing that circumvents most of the identified weaknesses and also offers a formal privacy guarantee. We evaluate the performance of our method via thorough experimentation and demonstrate superior performance over the discussed mechanisms.

show abstract