2021
DOI: 10.48550/arxiv.2111.14309
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A General Framework for Defending Against Backdoor Attacks via Influence Graph

Abstract: In this work, we propose a new and general framework to defend against backdoor attacks, inspired by the fact that attack triggers usually follow a SPECIFIC type of attacking pattern, and therefore, poisoned training examples have greater impacts on each other during training. We introduce the notion of the influence graph, which consists of nodes and edges respectively representative of individual training points and associated pair-wise influences. The influence between a pair of training points represents t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 61 publications
0
2
0
Order By: Relevance
“…In conjunction with the backdoor literature, several defences have been developed to mitigate the vulnerability caused by backdoors (Qi et al, 2021a,b;Sun et al, 2021;He et al, 2023). Depending on the access to the training data, defensive approaches can be categorised into two types: ( 1 Previous works have empirically demonstrated that for multiple NLP tasks, the attention scores attained from the self-attention module can provide plausible and meaningful interpretations of the model's prediction w.r.t each token (Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Vashishth et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…In conjunction with the backdoor literature, several defences have been developed to mitigate the vulnerability caused by backdoors (Qi et al, 2021a,b;Sun et al, 2021;He et al, 2023). Depending on the access to the training data, defensive approaches can be categorised into two types: ( 1 Previous works have empirically demonstrated that for multiple NLP tasks, the attention scores attained from the self-attention module can provide plausible and meaningful interpretations of the model's prediction w.r.t each token (Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Vashishth et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Qi et al [33] have shown that GPT2 can effectively identify trigger words targeting the corruption of text classifications. It has been demonstrated that one can use influence graphs as a means of the remedy for data poisoning on various NLP tasks [41] 3 CATER…”
Section: Watermark Removalmentioning
confidence: 99%