Universal Adversarial Attacks on Text Classifiers

Behjati, Melika; Moosavi-Dezfooli, Seyed-Mohsen; Baghshah, Mahdieh Soleymani; Frossard, Pascal

doi:10.1109/icassp.2019.8682430

Cited by 74 publications

(35 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides efforts devoted into MRC systems, many efforts are also devoted into adversarial attacking methods on text. (Behjati et al 2019) tried to distract a text classifier by training perturbation embeddings. (Iyyer et al 2018) proposed a syntactically controlled paraphrase networks to generate grammatically adversarial examples.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Robust Adversarial Training Approach to Machine Reading Comprehension

Liu

et al. 2020

AAAI

View full text Add to dashboard Cite

Lacking robustness is a serious problem for Machine Reading Comprehension (MRC) models. To alleviate this problem, one of the most promising ways is to augment the training dataset with sophisticated designed adversarial examples. Generally, those examples are created by rules according to the observed patterns of successful adversarial attacks. Since the types of adversarial examples are innumerable, it is not adequate to manually design and enrich training data to defend against all types of adversarial attacks. In this paper, we propose a novel robust adversarial training approach to improve the robustness of MRC models in a more generic way. Given an MRC model well-trained on the original dataset, our approach dynamically generates adversarial examples based on the parameters of current model and further trains the model by using the generated examples in an iterative schedule. When applied to the state-of-the-art MRC models, including QANET, BERT and ERNIE2.0, our approach obtains significant and comprehensive improvements on 5 adversarial datasets constructed in different ways, without sacrificing the performance on the original SQuAD development set. Moreover, when coupled with other data augmentation strategy, our approach further boosts the overall performance on adversarial datasets and outperforms the state-of-the-art methods.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Similar to (Behjati et al 2019;Gong et al 2018;Sato et al 2018), our perturbation adversarial training method aims to train a perturbation embedding sequence for each instance under the supervision of target model so as to distract it.…”

Section: Perturbation Embedding Trainingmentioning

confidence: 99%

A Robust Adversarial Training Approach to Machine Reading Comprehension

Liu

et al. 2020

AAAI

View full text Add to dashboard Cite

show abstract

“…Universal Attacks in NLP: Ribeiro et al (2018) debugged models using semantic-preserving perturbations that forced changes in predictions for downstream tasks such as sentiment analysis, visual QA and machine comprehension. Behjati et al (2019) crafted data-independent adversarial sequences that can fool text classifier when added to any input sample. Alternatively, Wallace et al (2019) study triggers in the form of a word or a few words to analyze models and biases in datasets for LM, text classification.…”

Section: Related Workmentioning

confidence: 99%

How Effectively Can Machines Defend Against Machine-Generated Fake News? An Empirical Study

Bhat¹,

Parthasarathy²

2020

Proceedings of the First Workshop on Insights From Negative Results in NLP

View full text Add to dashboard Cite

We empirically study the effectiveness of machine-generated fake news detectors by understanding the model's sensitivity to different synthetic perturbations during test time. The current machine-generated fake news detectors rely on provenance to determine the veracity of news. Our experiments find that the success of these detectors can be limited since they are rarely sensitive to semantic perturbations and are very sensitive to syntactic perturbations. Also, we would like to open-source our code and believe it could be a useful diagnostic tool for evaluating models aimed at fighting machine-generated fake news.

show abstract

“…Contrary to adversarial perturbation, UAP is data-independent and can be added to any input in order to fool the classifier with high confidence. Wallace et al [12] and Behjati et al [13] recently demonstrated a successful universal adversarial attack of the NLP model. In the actual scene, on the one hand, the final reader of the experimental text data is human, so it is a basic requirement to ensure the naturalness of the text; on the other hand, in order to prevent universal adversarial perturbation from being discovered by humans, the naturalness of adversarial perturbation is more important.…”

Section: Introductionmentioning

confidence: 99%

Universal Adversarial Attack via Conditional Sampling for Text Classification

2021

View full text Add to dashboard Cite

Despite deep neural networks (DNNs) having achieved impressive performance in various domains, it has been revealed that DNNs are vulnerable in the face of adversarial examples, which are maliciously crafted by adding human-imperceptible perturbations to an original sample to cause the wrong output by the DNNs. Encouraged by numerous researches on adversarial examples for computer vision, there has been growing interest in designing adversarial attacks for Natural Language Processing (NLP) tasks. However, the adversarial attacking for NLP is challenging because text is discrete data and a small perturbation can bring a notable shift to the original input. In this paper, we propose a novel method, based on conditional BERT sampling with multiple standards, for generating universal adversarial perturbations: input-agnostic of words that can be concatenated to any input in order to produce a specific prediction. Our universal adversarial attack can create an appearance closer to natural phrases and yet fool sentiment classifiers when added to benign inputs. Based on automatic detection metrics and human evaluations, the adversarial attack we developed dramatically reduces the accuracy of the model on classification tasks, and the trigger is less easily distinguished from natural text. Experimental results demonstrate that our method crafts more high-quality adversarial examples as compared to baseline methods. Further experiments show that our method has high transferability. Our goal is to prove that adversarial attacks are more difficult to detect than previously thought and enable appropriate defenses.

show abstract

Universal Adversarial Attacks on Text Classifiers

Cited by 74 publications

References 14 publications

A Robust Adversarial Training Approach to Machine Reading Comprehension

A Robust Adversarial Training Approach to Machine Reading Comprehension

How Effectively Can Machines Defend Against Machine-Generated Fake News? An Empirical Study

Universal Adversarial Attack via Conditional Sampling for Text Classification

Contact Info

Product

Resources

About