Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

Wang, Bolun; Yao, Yuanshun; Shan, Shawn; Li, Huiying; Viswanath, Bimal; Zheng, Haitao; Zhao, Ben Y.

doi:10.1109/sp.2019.00031

Cited by 989 publications

(1,273 citation statements)

References 28 publications

Supporting

Mentioning

1,203

Contrasting

Unclassified

Order By: Relevance

“…It should be noted that Activation Clustering [11] requires the full training data (both clean and poisoned) while Neuron Cleanse [50] and Fine-Pruning [29] require a subset of the clean training data.…”

Section: Backdoor Attacks On Dnnmentioning

confidence: 99%

See 1 more Smart Citation

Latent Backdoor Attacks on Deep Neural Networks

Yao

Zheng

et al. 2019

Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security

Self Cite

286

169

View full text Add to dashboard Cite

Recent work has proposed the concept of backdoor attacks on deep neural networks (DNNs), where misbehaviors are hidden inside "normal" models, only to be triggered by very specific inputs. In practice, however, these attacks are difficult to perform and highly constrained by sharing of models through transfer learning. Adversaries have a small window during which they must compromise the student model before it is deployed.In this paper, we describe a significantly more powerful variant of the backdoor attack, latent backdoors, where hidden rules can be embedded in a single "Teacher" model, and automatically inherited by all "Student" models through the transfer learning process. We show that latent backdoors can be quite effective in a variety of application contexts, and validate its practicality through real-world attacks against traffic sign recognition, iris identification of lab volunteers, and facial recognition of public figures (politicians). Finally, we evaluate 4 potential defenses, and find that only one is effective in disrupting latent backdoors, but might incur a cost in classification accuracy as tradeoff.

show abstract

Section: Backdoor Attacks On Dnnmentioning

confidence: 99%

“…Digit. This application is commonly used in studying DNN vulnerabilities including normal backdoors [19,50]. Both Teacher and Student tasks are to recognize hand-written digits, where Teacher Table 1: Summary of tasks, models, and datasets used in our evaluation using four tasks.…”

Section: Experiments Setupmentioning

confidence: 99%

Latent Backdoor Attacks on Deep Neural Networks

Yao

Zheng

et al. 2019

Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security

Self Cite

286

169

View full text Add to dashboard Cite

show abstract

“…Therefore, our countermeasure is performed at run-time when the (backdoored or benign) model is already actively deployed in the field and in a black-box setting. 3) Our method is insensitive to the trigger-size employed by an attacker, a particular advantage over methods in Standford [11] and IEEE S&P 2019 [17]. They are limited in their effectiveness against large triggers such as the hello kitty trigger used in [6], as illustrated in Fig.…”

Section: A Our Contributions and Resultsmentioning

confidence: 99%

Strip

Gao¹,

Wang

et al. 2019

Proceedings of the 35th Annual Computer Security Applications Conference

405

View full text Add to dashboard Cite

A recent trojan attack on deep neural network (DNN) models is one insidious variant of data poisoning attacks. Trojan attacks exploit an effective backdoor created in a DNN model by leveraging the difficulty in interpretability of the learned model to misclassify any inputs signed with the attacker's chosen trojan trigger. Since the trojan trigger is a secret guarded and exploited by the attacker, detecting such trojan inputs is a challenge, especially at run-time when models are in active operation. This work builds STRong Intentional Perturbation (STRIP) based run-time trojan attack detection system and focuses on vision system. We intentionally perturb the incoming input, for instance by superimposing various image patterns, and observe the randomness of predicted classes for perturbed inputs from a given deployed model-malicious or benign. A low entropy in predicted classes violates the input-dependence property of a benign model and implies the presence of a malicious input-a characteristic of a trojaned input. The high efficacy of our method is validated through case studies on three popular and contrasting datasets: MNIST, CIFAR10 and GTSRB. We achieve an overall false acceptance rate (FAR) of less than 1%, given a preset false rejection rate (FRR) of 1%, for different types of triggers. Using CIFAR10 and GTSRB, we have empirically achieved result of 0% for both FRR and FAR. We have also evaluated STRIP robustness against a number of trojan attack variants and adaptive attacks.

show abstract

“…For all 25 DNNs being attacked, the maximum reciprocals are much larger than for the 25 clean DNNs. This detector achieves outstanding detection performance -much better than an earlier detector [90]. All 25 attacks are successfully detected, among which, both source and target classes used for devising the attack are correctly inferred for 23 out of 25 attack instances; for the other two attack instances, only the target class is correctly inferred.…”

Section: Backdoor Detection Without the Training Setmentioning

confidence: 96%

Adversarial Learning Targeting Deep Neural Network Classification: A Comprehensive Review of Defenses Against Attacks

2020

View full text Add to dashboard Cite

With the wide deployment of machine learning (ML) based systems for a variety of applications including medical, military, automotive, genomic, as well as multimedia and social networking, there is great potential for damage from adversarial learning (AL) attacks. In this paper, we provide a contemporary survey of AL, focused particularly on defenses against attacks on deep neural network classifiers. After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasion (TTE), data poisoning (DP), backdoor DP, and reverse engineering (RE) attacks and particularly defenses against same. In so doing, we distinguish robust classification from anomaly detection (AD), unsupervised from supervised, and statistical hypothesis-based defenses from ones that do not have an explicit null (no attack) hypothesis. We also consider several scenarios for detecting backdoors. We provide a technical assessment for reviewed works, including identifying any issues/limitations, required hyperparameters, needed computational complexity, as well as the performance measures evaluated and the obtained quality. We then dig deeper, providing novel insights that challenge conventional AL wisdom and that target unresolved issues, including: 1) robust classification versus AD as a defense strategy; 2) the belief that attack success increases with attack strength, which ignores susceptibility to AD; 3) small perturbations for test-time evasion attacks: a fallacy or a requirement?; 4) validity of the universal assumption that a TTE attacker knows the ground-truth class for the example to be attacked; 5) black, grey, or white box attacks as the standard for defense evaluation; 6) susceptibility of query-based RE to an AD defense. We also discuss attacks on the privacy of training data. We then present benchmark comparisons of several defenses against TTE, RE, and backdoor DP attacks on images. The paper concludes with a discussion of continuing research directions, including the supreme challenge of detecting attacks whose goal is not to alter classification decisions, but rather simply to embed, without detection, "fake news" or other false content. Index Termstest-time-evasion, data poisoning, backdoor, reverse engineering, deep neural networks, anomaly detection, robust classification, black box, white box, targeted attacks, transferability, membership inference attack The authors are with the

show abstract

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

Cited by 989 publications

References 28 publications

Latent Backdoor Attacks on Deep Neural Networks

Latent Backdoor Attacks on Deep Neural Networks

Strip

Adversarial Learning Targeting Deep Neural Network Classification: A Comprehensive Review of Defenses Against Attacks

Contact Info

Product

Resources

About