A General Framework for Defending Against Backdoor Attacks via Influence Graph

Sun, Xiaofei; Li, Jiwei; Li, Xiaoya; Wang, Ziyao; Zhang, Tianwei; Qiu, Han; Wu, Fei; Fan, Chun

doi:10.48550/arxiv.2111.14309

Cited by 2 publications

(2 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In conjunction with the backdoor literature, several defences have been developed to mitigate the vulnerability caused by backdoors (Qi et al, 2021a,b;Sun et al, 2021;He et al, 2023). Depending on the access to the training data, defensive approaches can be categorised into two types: ( 1 Previous works have empirically demonstrated that for multiple NLP tasks, the attention scores attained from the self-attention module can provide plausible and meaningful interpretations of the model's prediction w.r.t each token (Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Vashishth et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

He¹,

Wang²,

Rubinstein³

et al. 2023

Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

View full text Add to dashboard Cite

Backdoor attacks are an insidious security threat against machine learning models. Adversaries can manipulate the predictions of compromised models by inserting triggers into the training phase. Various backdoor attacks have been devised which can achieve nearly perfect attack success without affecting model predictions for clean inputs. Means of mitigating such vulnerabilities are underdeveloped, especially in natural language processing. To fill this gap, we introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks at inference time. Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers. Thus, it significantly reduces the attack success rate while attaining competitive accuracy on the clean dataset across widespread insertion-based attacks compared to two baselines. Finally, we show that our approach is model-agnostic, and can be easily ported to several pre-trained transformer models. * Now at Google DeepMind. 1 According to statistics from Hugging Face, BERT receives 15M downloads per month.

show abstract

Section: Related Workmentioning

confidence: 99%

IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

He¹,

Wang²,

Rubinstein³

et al. 2023

Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

View full text Add to dashboard Cite

show abstract

“…Qi et al [33] have shown that GPT2 can effectively identify trigger words targeting the corruption of text classifications. It has been demonstrated that one can use influence graphs as a means of the remedy for data poisoning on various NLP tasks [41] 3 CATER…”

Section: Watermark Removalmentioning

confidence: 99%

CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks

He¹,

Xu²,

Zeng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Previous works have validated that text generation APIs can be stolen through imitation attacks, causing IP violations. In order to protect the IP of text generation APIs, a recent work has introduced a watermarking algorithm and utilized the null-hypothesis test as a post-hoc ownership verification on the imitation models. However, we find that it is possible to detect those watermarks via sufficient statistics of the frequencies of candidate watermarking words. To address this drawback, in this paper, we propose a novel Conditional wATERmarking framework (CATER) for protecting the IP of text generation APIs. An optimization method is proposed to decide the watermarking rules that can minimize the distortion of overall word distributions while maximizing the change of conditional word selections. Theoretically, we prove that it is infeasible for even the savviest attacker (they know how CATER works) to reveal the used watermarks from a large pool of potential word pairs based on statistical inspection. Empirically, we observe that high-order conditions lead to an exponential growth of suspicious (unused) watermarks, making our crafted watermarks more stealthy. In addition, CATER can effectively identify the IP infringement under architectural mismatch and cross-domain imitation attacks, with negligible impairments on the generation quality of victim APIs. We envision our work as a milestone for stealthily protecting the IP of text generation APIs.

show abstract

A General Framework for Defending Against Backdoor Attacks via Influence Graph

Cited by 2 publications

References 61 publications

IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks

Contact Info

Product

Resources

About