Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.296
|View full text |Cite
|
Sign up to set email alerts
|

A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger’s Adversarial Attacks

Abstract: The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger generates a fixed phrase that, when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this attack that can cause significant harm, in this paper, we borrow the "honeypot" concept from the cybersecurity community and propose DARCY, a honeypot-based defense framewor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 16 publications
(11 citation statements)
references
References 22 publications
0
11
0
Order By: Relevance
“…A seminal work (Xu et al, 2018) on adversarial example detection in the image domain assumes the first scenario, whereas existing works in NLP (Le et al, 2021;Mozes et al, 2021) only experiment on the second scenario. Our benchmark framework provides the data and tools for experimenting on both.…”
Section: Detecting Adversarial Examplesmentioning
confidence: 99%
See 2 more Smart Citations
“…A seminal work (Xu et al, 2018) on adversarial example detection in the image domain assumes the first scenario, whereas existing works in NLP (Le et al, 2021;Mozes et al, 2021) only experiment on the second scenario. Our benchmark framework provides the data and tools for experimenting on both.…”
Section: Detecting Adversarial Examplesmentioning
confidence: 99%
“…FGWS (Mozes et al, 2021) outperforms DISP in detection by building on the observation that attacked samples are composed of rare words on 2 attack methods. Le et al (2021) tackle a particular attack method called UniTrigger (Wallace et al, 2019), which pre-pends or appends an identical phrase in all sentences. While the performance is impressive, applying this method to other attacks requires significant adjustment due to the distinct characteristics of UniTrigger.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Existing defense approaches against UAT maintain additional adversary detectors. For example, DARCY (Le et al, 2021) searches for the potential triggers first, and then retrain a classifier using the exploited triggers as a UAT adversary detector; T-Miner (Azizi et al, 2021) leverages Seq2Seq model to probe the hidden representation of the suspicious classifier into a synthetic text sequence that is likely to contain adversarial triggers. In addition, neither of the methods have been tested on large-scale pretrained LM.…”
Section: A Comparison With Regular Finetuningmentioning
confidence: 99%
“…In addition, a robust ML pipeline is often equipped to detect and remove potential adversarial perturbations either via automatic software (Jayanthi et al, 2020;Pruthi et al, 2019), trapdoors (Le et al, 2021) or human-in-the-loop (Le et al, 2020). Such detection is feasible especially when the perturbed texts are curated using a set of fixed rules that can be easily re-purposed for defense.…”
Section: Introductionmentioning
confidence: 99%