BIJIIAC 2021
DOI: 10.54646/bijiiac.004
|View full text |Cite
|
Sign up to set email alerts
|

White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

Abstract: Attention based Transformer models have achieved state-of-the-art results in natural language processing (NLP). However, recent work shows that the underlying attention mechanism can be exploited by adversaries to craft malicious inputs designed to induce spurious outputs, thereby harming model performance and trustworthiness. Unlike in the vision domain, the literature examining neural networks under adversarial conditions in the NLP domain is limited and most of it focuses mainly on the English language. In … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 8 publications
0
1
0
Order By: Relevance
“…As an example, [8] used BERT Probe to evaluate the robustness of German language hate-speech attention based classifiers, showing the ease with which such models can be tricked. Furthermore, [8] used the defenses available in BERT Probe as a solution against the attacks.…”
Section: Impact Overviewmentioning
confidence: 99%
“…As an example, [8] used BERT Probe to evaluate the robustness of German language hate-speech attention based classifiers, showing the ease with which such models can be tricked. Furthermore, [8] used the defenses available in BERT Probe as a solution against the attacks.…”
Section: Impact Overviewmentioning
confidence: 99%