What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations

Xie, Zhouhang; Brophy, Jonathan; Noack, Adam; You, Wencong; Asthana, Kalyani; Carter, Perkins,; Reis, Sabrina T.; Hammoudeh, Zayd; Lowd, Daniel; Singh, Sameer

doi:10.18653/v1/2021.blackboxnlp-1.6

Cited by 1 publication

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These automated word-level attacks mostly rely on the knowledge of existing target models and algorithms' intensive search to locate the best synonym substitutions. However, recent work (Xie et al, 2021(Xie et al, , 2022 shows that the quality of generated adversarial examples is actually far from satisfactory, with respect to the low attack success rate across domains, incorrect grammar, and distorted meaning.…”

Section: Related Workmentioning

confidence: 99%

“…More recently, humans have developed automated adversarial attacks that minimally modify text while changing the output of a classifier or other NLP systems (Ebrahimi et al, 2018). These automated attacks have the potential to be much more efficient than humans, helping attackers to find weaknesses in models and helping defenders find and patch those same weaknesses (Xie et al, 2021;Zhou et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards Stronger Adversarial Baselines Through Human-AI Collaboration

You¹,

Lowd²

2022

Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

Self Cite

View full text Add to dashboard Cite

Natural language processing (NLP) systems are often used for adversarial tasks such as detecting spam, abuse, hate speech, and fake news. Properly evaluating such systems requires dynamic evaluation that searches for weaknesses in the model, rather than a static test set. Prior work has evaluated such models on both manually and automatically generated examples, but both approaches have limitations: manually constructed examples are time-consuming to create and are limited by the imagination and intuition of the creators, while automatically constructed examples are often ungrammatical or labeled inconsistently. We propose to combine human and AI expertise in generating adversarial examples, benefiting from humans' expertise in language and automated attacks' ability to probe the target system more quickly and thoroughly. We present a system that facilitates attack construction, combining human judgment with automated attacks to create better attacks more efficiently. Preliminary results from our own experimentation suggest that human-AI hybrid attacks are more effective than either human-only or AI-only attacks. A complete user study to validate these hypotheses is still pending.

show abstract

Section: Related Workmentioning

confidence: 99%