Gotta Catch'Em All: Using Honeypots to Catch Adversarial Attacks on Neural Networks

Shan, Shawn; Wenger, Emily; Wang, Bolun; Li, Bo; Zheng, Haitao; Zhao, Ben Y.

doi:10.1145/3372297.3417231

Cited by 66 publications

(60 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The attackers may employ various backdoor detection techniques (Wang et al, 2019b;Qiao et al, 2019) to detect if F contains trapdoors. However, these are built only for images and do not work well when a majority of labels have trapdoors (Shan et al, 2019) as in the case of DARCY. Recently, a few works proposed to detect backdoors in texts.…”

Section: Discussionmentioning

confidence: 99%

“…Honeypot-based Adversarial Detection. (Shan et al, 2019) adopts the "honeypot" concept to images. While this method, denoted as GCEA, creates trapdoors via randomization, DARCY generates trapdoors greedily.…”

Section: Related Workmentioning

confidence: 99%

“…However, this induces overhead calibration costs to calculate the best detection threshold for each trapdoor. Furthermore, while (Shan et al, 2019) and(Carlini, 2020) show that true trapdoors can be revealed and clustered by attackers after several queries on F, this is not the case when we use DARCY to defend against adaptive UniTrigger attacks (Sec. 4.2).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks

Le¹,

Park²,

Lee³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger generates a fixed phrase that, when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this attack that can cause significant harm, in this paper, we borrow the "honeypot" concept from the cybersecurity community and propose DARCY, a honeypot-based defense framework against UniTrigger. DARCY greedily searches and injects multiple trapdoors into an NN model to "bait and catch" potential attacks. Through comprehensive experiments across four public datasets, we show that DARCY detects UniTrigger's adversarial attacks with up to 99% TPR and less than 2% FPR in most cases, while maintaining the prediction accuracy (in F1) for clean inputs within a 1% margin. We also demonstrate that DARCY with multiple trapdoors is also robust to a diverse set of attack scenarios with attackers' varying levels of knowledge and skills.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks

Le¹,

Park²,

Lee³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…The backdoor trigger can be generated by the conditional generative model if its perturbation level incurs an anomaly detection. Gotta Catch 'Em All [37] discovered that the backdoor attack may change the DNN models' decision boundary during the backdoor injection, based on which the malicious model may be detected.…”

Section: Related Workmentioning

confidence: 99%

Defense-Resistant Backdoor Attacks Against Deep Neural Networks in Outsourced Cloud Environment

Gong

Chen

Wang

et al. 2021

IEEE J. Select. Areas Commun.

View full text Add to dashboard Cite

The time and monetary costs of training sophisticated deep neural networks are exorbitant, which motivates resource-limited users to outsource the training process to the cloud. Concerning that an untrustworthy cloud service provider may inject backdoors to the returned model, the user can leverage state-of-the-art defense strategies to examine the model. In this paper, we aim to develop robust backdoor attacks (named RobNet) that can evade existing defense strategies from the standpoint of malicious cloud providers. The key rationale is to diversify the triggers and strengthen the model structure so that the backdoor is hard to be detected or removed. To attain this objective, we refine the trigger generation algorithm by selecting the neuron(s) with large weights and activations and then computing the triggers via gradient descent to maximize the value of the selected neuron(s). In stark contrast to existing works that fix the trigger location, we design a multi-location patching method to make the model less sensitive to mild displacement of triggers in real attacks. Furthermore, we extend the attack space by proposing multi-trigger backdoor attacks that can misclassify inputs with different triggers into the same or different target label(s). We evaluate the performance of RobNet on MNIST, GTSRB, and CIFAR-10 datasets, against four representative defense strategies Pruning, NeuralCleanse, Strip, and ABS. The comparison with two state-of-the-art baselines BadNets and Hidden Backdoors demonstrates that RobNet achieves higher attack success rate and is more resistant to potential defenses.

show abstract

“…A vast body of research has been dedicated to AE defense, considering the severity of the threat. Existing methods include model robustification with adversarial training techniques (e.g., [49], [66]), input transformation to mitigate the impact of AEs (e.g., [51], [61]), and various types of AE detectors that try to differentiate legitimate inputs and AEs according to specific criteria (e.g., [13], [67]). While effectively improving the robustness of DNN models, to the best of our knowledge, they all suffer from some weaknesses, e.g., defending against only a subset of AEs or causing a relatively high accuracy loss for legitimate inputs.…”

Section: Introductionmentioning

confidence: 99%

What You See is Not What the Network Infers: Detecting Adversarial Examples Based on Semantic Contradiction

Yang,

Gao,

et al. 2022

Preprint

View full text Add to dashboard Cite

Adversarial examples (AEs) pose severe threats to the applications of deep neural networks (DNNs) to safety-critical domains, e.g., autonomous driving. While there has been a vast body of AE defense solutions, to the best of our knowledge, they all suffer from some weaknesses, e.g., defending against only a subset of AEs or causing a relatively high accuracy loss for legitimate inputs. Moreover, most existing solutions cannot defend against adaptive attacks, wherein attackers are knowledgeable about the defense mechanisms and craft AEs accordingly.In this paper, we propose a novel AE detection framework based on the very nature of AEs, i.e., their semantic information is inconsistent with the discriminative features extracted by the target DNN model. To be specific, the proposed solution, namely ContraNet 1 , models such contradiction by first taking both the input and the inference result to a generator to obtain a synthetic output and then comparing it against the original input. For legitimate inputs that are correctly inferred, the synthetic output tries to reconstruct the input. On the contrary, for AEs, instead of reconstructing the input, the synthetic output would be created to conform to the wrong label whenever possible. Consequently, by measuring the distance between the input and the synthetic output with metric learning, we can differentiate AEs from legitimate inputs. We perform comprehensive evaluations under various AE attack scenarios, and experimental results show that ContraNet outperforms existing solutions by a large margin, especially under adaptive attacks. Moreover, our analysis shows that successful AEs that can bypass ContraNet tend to have much-weakened adversarial semantics. We have also shown that ContraNet can be easily combined with adversarial training techniques to achieve further improved AE defense capabilities.

show abstract

Gotta Catch'Em All: Using Honeypots to Catch Adversarial Attacks on Neural Networks

Cited by 66 publications

References 23 publications

A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks

A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks

Defense-Resistant Backdoor Attacks Against Deep Neural Networks in Outsourced Cloud Environment

What You See is Not What the Network Infers: Detecting Adversarial Examples Based on Semantic Contradiction

Contact Info

Product

Resources

About