HateBERT: Retraining BERT for Abusive Language Detection in English

Caselli, Tommaso; Basile, Valerio; Mitrović, Jelena; Granitzer, Michael

doi:10.18653/v1/2021.woah-1.3

Cited by 152 publications

(112 citation statements)

References 31 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…Our results are favorable, ranging from 0.68-0.80 in macro-F1. Our results also outperform a variant of BERT that has been pretrained using hateful texts (Caselli et al, 2020): we achieved 0.61 in hate-F1 using the generic BERT finetuned on the original SE dataset vs. their 0.65, and improved this result to 0.77 with data augmentation. Compared with the CNN-GRU results (Zhang et al, 2018) reported in Wullach et al (2021), we obtain better results both prior and post augmentation in most cases.…”

Section: Cross-dataset Resultsmentioning

confidence: 73%

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech

Wullach¹,

Adler²,

Minkov³

2021

Preprint

View full text Add to dashboard Cite

Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor generalization. We employ pretrained language models (LMs) to alleviate this data bottleneck. We utilize the GPT LM for generating large amounts of synthetic hate speech sequences from available labeled examples, and leverage the generated data in fine-tuning large pretrained LMs on hate detection. An empirical study using the models of BERT, RoBERTa and ALBERT, shows that this approach improves generalization significantly and consistently within and across data distributions. In fact, we find that generating relevant labeled hate speech sequences is preferable to using out-of-domain, and sometimes also within-domain, human-labeled examples.

show abstract

Section: Cross-dataset Resultsmentioning

confidence: 73%

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech

Wullach¹,

Adler²,

Minkov³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We show that training on labelled in-domain data leads to better performance than similarly sized out-of-domain datasets, confirming the differences between the domains and highlighting the need for conversational data. While performance using general domain pretrained models leaves room for improvement, in future work, we hope to experiment with different initialisation settings, using models trained on data and tasks more similar to those of ConvAbuse, such as HateBERT (Caselli et al, 2021) or HurtBERT (Koufakou et al, 2020).…”

Section: Discussionmentioning

confidence: 99%

ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

Curry,

Abercrombie,

Rieser

2021

Preprint

View full text Add to dashboard Cite

We present the first English corpus study on abusive language towards three conversational AI systems gathered 'in the wild': an opendomain social bot, a rule-based chatbot, and a task-based system. To account for the complexity of the task, we take a more 'nuanced' approach where our ConvAI dataset reflects fine-grained notions of abuse, as well as views from multiple expert annotators. We find that the distribution of abuse is vastly different compared to other commonly used datasets, with more sexually tinted aggression towards the virtual persona of these systems. Finally, we report results from bench-marking existing models against this data. Unsurprisingly, we find that there is substantial room for improvement with F1 scores below 90%.Warning: This paper contains examples of language that some people may find offensive or upsetting.

show abstract

“…For all BERT models, we use the pre-trained bertlarge-cased, for BART we use bart-large. 8 We also report experiments where we use a fine-tuned toxicity version of pre-trained BERT (Caselli et al, 2021, GroNLP/hateBERT) for f cnd in Dropout BERT (here referred to as Hate BERT, and when using a finetuned substitute classifier: Hate BERT+). The idea here is similar to that of using Dropout BERT+; domainspecific vocabularies will likely result in better and more varied substitutions.…”

Section: Augmentation Modelsmentioning

confidence: 99%

“…Our work combines multiple sizeable-to the extent that they respectively produced several surveys (Fortuna and Nunes, 2018;Gunasekara and Nejadgholi, 2018;Mishra et al, 2019;Banko et al, 2020;Madukwe et al, 2020;Muneer and Fati, 2020;Salawu et al, 2020;Jahan and Oussalah, 2021;Mladenovic et al, 2021) Recent cyberbullying work (Reynolds et al, 2011;Xu et al, 2012;Nitta et al, 2013;Bretschneider et al, 2014;Dadvar et al, 2014;Van Hee et al, 2015, e.g., are seminal work) has primarily focused on deploying Transformer-based models (Vaswani et al, 2017); by and large fine-tuning (Swamy et al, 2019;Paul and Saha, 2020;Gencoglu, 2021, e.g. ), or re-training (Caselli et al, 2020) BERT. It is worth noting that Elsafoury et al (2021a;Elsafoury et al (2021b) show that although fine-tuning BERT achieves state-of-theart performance in classification, its attention scores do not correlate with cyberbullying features, and they expect generalization of such models to be subpar.…”

Section: Related Workmentioning

confidence: 99%

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Emmery¹,

Kádár²,

Chrupała³

et al. 2022

Preprint

View full text Add to dashboard Cite

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora. The perturbed data, models, and code are available for reproduction at https://github.com/cmry/augtox.

show abstract

HateBERT: Retraining BERT for Abusive Language Detection in English

Cited by 152 publications

References 31 publications

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech

Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech

ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Contact Info

Product

Resources

About