Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.249
|View full text |Cite
|
Sign up to set email alerts
|

Weight Poisoning Attacks on Pretrained Models

Abstract: Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct "weight poisoning" attacks where pre-trained weights are injected with vulnerabilities that expose "backdoors" after fine-tuning, enabling the attacker to manipul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
220
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 184 publications
(220 citation statements)
references
References 35 publications
0
220
0
Order By: Relevance
“…Hence, data poisoning is easier to detect by evaluating the model on a set of clean validation dataset compared to BP. Closest to our work, (Kurita et al, 2020) showed that pretrained language models' weights can be injected with vulnerabilities which can enable manipulation of finetuned models' predictions. Different from them, our work here does not assume the pretrain-finetune paradigm and introduces the backdoor vulnerability through training data rather than the model's weights directly.…”
Section: Adversarial Attacksmentioning
confidence: 75%
“…Hence, data poisoning is easier to detect by evaluating the model on a set of clean validation dataset compared to BP. Closest to our work, (Kurita et al, 2020) showed that pretrained language models' weights can be injected with vulnerabilities which can enable manipulation of finetuned models' predictions. Different from them, our work here does not assume the pretrain-finetune paradigm and introduces the backdoor vulnerability through training data rather than the model's weights directly.…”
Section: Adversarial Attacksmentioning
confidence: 75%
“…Weight Poisoning Attacks on Pre-trained Models [28] is a recent paper that uses vulnerabilities in pre-trained models and strikes me as dangerous to all black box models that are not actively defending against those types of tasks. A future direction could be to develop a data augmentation method or model structure that makes weight poisoning attacks reduces the efficacy of weight poisoning attacks, During the defend phase, automated methods for detecting and correcting poisoned words could use transformer models to find and propose corrected word.…”
Section: Discussionmentioning
confidence: 99%
“…In one example, attackers controlled whether a review was classified as positive or negative by baking in triggers like the unusual letter combinations "cf" and "bb." 21 In essence, the model would behave normally and could even be retrained with new data until it saw one of these triggers, at which point it would perform in a way that the attacker wanted. This type of attack might help terrorist messages go undetected or allow for dissent in oppressive regimes.…”
Section: Pretrained Modelsmentioning
confidence: 99%