A New Measure of Polarization in the Annotation of Hate Speech

Akhtar, Sohail; Basile, Valerio; Patti, Viviana

doi:10.1007/978-3-030-35166-3_41

Cited by 19 publications

(26 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kenyon-Dean and colleagues found that over 30% of the instances in the corpus were "controversial" or "complicated" cases about which annotators disagreed. Akhtar et al (2019) experimented with partitioning the annotators in hate speech datasets into clusters reflecting more uniform subjective judgments in order to achieve increased inter-annotator agreement.…”

Section: Sentiment Analysis and Other Subjective Tasksmentioning

confidence: 99%

Learning from Disagreement: A Survey

Uma¹,

Fornaciari²,

Hovy³

et al. 2021

jair

View full text Add to dashboard Cite

Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (AI) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on NLP and CV tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials.

show abstract

Section: Sentiment Analysis and Other Subjective Tasksmentioning

confidence: 99%

Learning from Disagreement: A Survey

Uma¹,

Fornaciari²,

Hovy³

et al. 2021

jair

View full text Add to dashboard Cite

show abstract

“…Given the current rate of user-generated content produced in every minute, manually monitoring abusive behavior in social media is impractical. Facebook and Twitter also made efforts to eliminate abusive content from their platforms 1 by providing clear policies on hateful conducts 2 , implementing user report mechanisms, and employing content moderators to filter the abusive posting. However, these efforts are not a scalable and longterm solution to this problem.…”

Section: Introductionmentioning

confidence: 99%

Towards multidomain and multilingual abusive language detection: a survey

Pamungkas

Basile

Patti

2021

Pers Ubiquit Comput

Self Cite

View full text Add to dashboard Cite

Abusive language is an important issue in online communication across different platforms and languages. Having a robust model to detect abusive instances automatically is a prominent challenge. Several studies have been proposed to deal with this vital issue by modeling this task in the cross-domain and cross-lingual setting. This paper outlines and describes the current state of this research direction, providing an overview of previous studies, including the available datasets and approaches employed in both cross-domain and cross-lingual settings. This study also outlines several challenges and open problems of this area, providing insights and a useful roadmap for future work.

show abstract

“…Disagreement in annotation has been studied from a particular angle when occurring in highly subjective tasks such as offensive and abusive language detection or hate speech detection. Akhtar et al (2019) introduced the polarization index, aiming at measuring a particular form of disagreement stemming from clusters of annotators whose opinions on the subjective phenomenon are polarized, e.g., because of different cultural backgrounds. Specifically, polarization measures the ratio between intragroup and inter-group agreement at the individual instance level, capturing the cases where different groups of annotators strongly agree on different labels.…”

Section: Disagreement On 'Subjective' Tasksmentioning

confidence: 99%

“…Figure 1 shows two examples from CV and NLP. This is particularly true for tasks involving highly subjective judgments, such as hate speech detection (Akhtar et al, 2019(Akhtar et al, , 2020 or sentiment analysis (Kenyon-Dean et al, 2018). However, it is not a trivial issue even in more linguistic tasks, such as part-of-speech tagging (Plank et al, 2014), word sense disambiguation (Passonneau et al, 2012;Jurgens, 2013), or coreference resolution (Poesio and Artstein, 2005;Recasens et al, 2011).…”

Section: Introductionmentioning

confidence: 99%

We Need to Consider Disagreement in Evaluation

Basile¹,

Fell²,

Fornaciari³

et al. 2021

Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Self Cite

View full text Add to dashboard Cite

Evaluation is of paramount importance in datadriven research fields such as Natural Language Processing (NLP) and Computer Vision (CV). But current evaluation practice in NLP, except for end-to-end tasks such as machine translation, spoken dialogue systems, or NLG, largely hinges on the existence of a single "ground truth" against which we can meaningfully compare the prediction of a model. However, this assumption is flawed for two reasons. 1) In many cases, more than one answer is correct. 2) Even where there is a single answer, disagreement among annotators is ubiquitous, making it difficult to decide on a gold standard. We discuss three sources of disagreement: from the annotator, the data, and the context, and show how this affects even seemingly objective tasks. Current methods of adjudication, agreement, and evaluation ought to be reconsidered at the light of this evidence. Some researchers now propose to address this issue by minimizing disagreement, creating cleaner datasets. We argue that such a simplification is likely to result in oversimplified models just as much as it would do for end-to-end tasks such as machine translation. Instead, we suggest that we need to improve today's evaluation practice to better capture such disagreement. Datasets with multiple annotations are becoming more common, as are methods to integrate disagreement into modeling. The logical next step is to extend this to evaluation.

show abstract

A New Measure of Polarization in the Annotation of Hate Speech

Cited by 19 publications

References 19 publications

Learning from Disagreement: A Survey

Learning from Disagreement: A Survey

Towards multidomain and multilingual abusive language detection: a survey

We Need to Consider Disagreement in Evaluation

Contact Info

Product

Resources

About