Generating Natural Language Adversarial Examples

Alzantot, Moustafa; Sharma, Yash; Elgohary, Ahmed; Ho, Bo-Jhang; Srivastava, Mani; Chang, Kai-Wei

doi:10.18653/v1/d18-1316

Cited by 668 publications

(770 citation statements)

References 32 publications

Supporting

Mentioning

767

Contrasting

Order By: Relevance

“…Word substitution perturbations. We base our sets of allowed word substitutions S(x, i) on the substitutions allowed by Alzantot et al (2018). They demonstrated that their substitutions lead to adversarial examples that are qualitatively similar to the original input and retain the original label, as judged by humans.…”

Section: Setupmentioning

confidence: 99%

“…We make three modifications to this approach. First, in Alzantot et al (2018), the adversary applies substitutions one at a time, and the neighborhoods and language model scores are computed relative to the current altered version of the input. This results in a hard-to-define attack surface, as changing one word can allow or disallow changes to other words.…”

Section: Setupmentioning

confidence: 99%

“…1. Genetic attack accuracy: Alzantot et al (2018) demonstrate the effectiveness of a genetic algorithm that searches for perturbationsz that cause model misclassification. The algorithm maintains a "population" of candidatez's and repeatedly perturbs and combines them.…”

Section: Setupmentioning

confidence: 99%

See 2 more Smart Citations

Certified Robustness to Adversarial Word Substitutions

Jia

Raghunathan

Göksel

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

201

234

View full text Add to dashboard Cite

State-of-the-art NLP models can often be fooled by adversaries that apply seemingly innocuous label-preserving transformations (e.g., paraphrasing) to input text. The number of possible transformations scales exponentially with text length, so data augmentation cannot cover all transformations of an input. This paper considers one exponentially large family of label-preserving transformations, in which every word in the input can be replaced with a similar word. We train the first models that are provably robust to all word substitutions in this family. Our training procedure uses Interval Bound Propagation (IBP) to minimize an upper bound on the worst-case loss that any combination of word substitutions can induce. To evaluate models' robustness to these transformations, we measure accuracy on adversarially chosen word substitutions applied to test examples. Our IBP-trained models attain 75% adversarial accuracy on both sentiment analysis on IMDB and natural language inference on SNLI. In comparison, on IMDB, models trained normally and ones trained with data augmentation achieve adversarial accuracy of only 8% and 35%, respectively.

show abstract

Section: Setupmentioning

confidence: 99%

Section: Setupmentioning

confidence: 99%

See 1 more Smart Citation

Certified Robustness to Adversarial Word Substitutions

Jia

Raghunathan

Göksel

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

201

234

View full text Add to dashboard Cite

show abstract

“…Adversarial training is the prevailing counter-measure to build a robust model (Goodfellow et al, 2015;Iyyer et al, 2018;Marzinotto et al, 2019;Cheng et al, 2019; by mixing adversarial examples with the original ones during training the model. However, these adversarial examples can be detected and deactivated by a genetic algorithm (Alzantot et al, 2018). This method also requires retraining, which can be time and cost consuming for large-scale models.…”

Section: Related Workmentioning

confidence: 99%

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Zhou¹,

Jiang²,

Chang³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to discriminate perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations.

show abstract

“…In machine translation, attention learns to align foreign words with their native counterparts (Bahdanau et al, 2015). On the other hand, neural networks often do not behave as humans (Szegedy et al, 2014;Jia and Liang, 2017;Ribeiro et al, 2018;Alzantot et al, 2018). Convolutional networks rely heavily on texture (Geirhos et al, 2019), while humans rely on shape (Landau et al, 1988).…”

Section: Human Evaluation Of Evidencementioning

confidence: 99%

Finding Generalizable Evidence by Learning to Convince Q&A Models

Perez¹,

Karamcheti²,

Fergus³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

We propose a system that finds the strongest supporting evidence for a given answer to a question, using passage-based questionanswering (QA) as a testbed. We train evidence agents to select the passage sentences that most convince a pretrained QA model of a given answer, if the QA model received those sentences instead of the full passage. Rather than finding evidence that convinces one model alone, we find that agents select evidence that generalizes; agent-chosen evidence increases the plausibility of the supported answer, as judged by other QA models and humans. Given its general nature, this approach improves QA in a robust manner: using agentselected evidence (i) humans can correctly answer questions with only ∼20% of the full passage and (ii) QA models can generalize to longer passages and harder questions. . 2018a. Evidence extraction for machine reading comprehension with deep probabilistic logic. CoRR, abs/1902.08852.

show abstract

Generating Natural Language Adversarial Examples

Cited by 668 publications

References 32 publications

Certified Robustness to Adversarial Word Substitutions

Certified Robustness to Adversarial Word Substitutions

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Finding Generalizable Evidence by Learning to Convince Q&A Models

Contact Info

Product

Resources

About