Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.314
|View full text |Cite
|
Sign up to set email alerts
|

ER-AE: Differentially Private Text Generation for Authorship Anonymization

Abstract: Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recipients to interpret. We propose a novel text generation model with a two-set exponential mechanism for authorship an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 20 publications
0
11
0
Order By: Relevance
“…Section 2.1) is a notion of privacy based on randomness, i.e., any non-trivial DP mechanism must be non-deterministic. In the context of textual data, existing DP mechanisms typically operate on a word-by-word basis [6,[23][24][25]63] by randomly determining each word in the output sequence. The common ground of these methods is to impose a probability distribution, locally at each position, over the words in the output space (or similarly, over the embedding space) and draw a word (or corresponding embedding) from that distribution.…”
Section: Differentially Private Inferencementioning
confidence: 99%
See 2 more Smart Citations
“…Section 2.1) is a notion of privacy based on randomness, i.e., any non-trivial DP mechanism must be non-deterministic. In the context of textual data, existing DP mechanisms typically operate on a word-by-word basis [6,[23][24][25]63] by randomly determining each word in the output sequence. The common ground of these methods is to impose a probability distribution, locally at each position, over the words in the output space (or similarly, over the embedding space) and draw a word (or corresponding embedding) from that distribution.…”
Section: Differentially Private Inferencementioning
confidence: 99%
“…Weggenmann and Kerschbaum [63] and Fernandes et al [23] approach anonymization through applying DP mechanisms on document term frequency vectors, obtaining differentially private text representations that are, however, not human-readable. Other methods relying on DP mechanisms apply these on a word-level by perturbing words or their embeddings and thus exchanging individual tokens in a given text [6,24,25]. Deep Learning based approaches for authorship obfuscation include back-translation [37,64] as well as the incorporation of GANs as discriminators [55].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…While not humanreadable, these vector representations can be shared for automated processing, such as topic or sentiment inference and machine learning. To generate human-readable text, Bo et al (2021) employ an encoder-decoder model similar to ours, but without paraphrasing, and sample output words using (a two-set variant of) the Exponential mechanism (McSherry and Talwar, 2007). Weggenmann et al (2022) propose a differentially private variation of the variational autoencoder and use it as a sequenceto-sequence architecture for text anonymization.…”
Section: Related Workmentioning
confidence: 99%
“…Previous work in the field of authorship obfuscation mainly focuses on two different tasks, namely learning anonymous textual vector representations for downstream tasks (Coavoux et al, 2018a;Weggenmann and Kerschbaum, 2018;Fernandes et al, 2019;Mosallanezhad et al, 2019;Beigi et al, 2019) and the development of mechanisms that transform the input sentence to remove properties revealing the author and thus output human-readable text. Works within the second category (Feyisetan et al, 2019(Feyisetan et al, , 2020Xu et al, 2020b;Bo et al, 2021) typically follow a common word level framework which is characterized by the differentially private individual perturbation of word embeddings and the subsequent sampling of new words that are close to the perturbed vectors in the embedding space. Also, the majority of recent work proposing new methods for authorship obfuscation deals with the optimization and calibration of noise sampling mechanisms (Xu et al, 2020a) or the definition of new distributions to sample noise from (Feyisetan et al, 2019) as opposed to the development of entirely new methods.…”
Section: Introductionmentioning
confidence: 99%