Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media 2020
DOI: 10.18653/v1/2020.socialnlp-1.4
|View full text |Cite
|
Sign up to set email alerts
|

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Abstract: Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 47 publications
(30 citation statements)
references
References 16 publications
0
30
0
Order By: Relevance
“…KHS-The Korean Hate Speech (KHS) [62] dataset is a collection of 8367 news comments, labeled as one of "hate", "offensive", or "none" by human annotators. As gold labels for the test split are unavailable, we report the best f1-score on the validation split for this task.…”
Section: Datasetmentioning
confidence: 99%
“…KHS-The Korean Hate Speech (KHS) [62] dataset is a collection of 8367 news comments, labeled as one of "hate", "offensive", or "none" by human annotators. As gold labels for the test split are unavailable, we report the best f1-score on the validation split for this task.…”
Section: Datasetmentioning
confidence: 99%
“…Several datasets for classifying toxicity on toxic speech on online forums, such as the dataset provided by Waseem and Hovy (2016) for English, BEEP! dataset for Korean by Moon et al (2020), the dataset for Russian provided by Smetanin (2020), TolD-Br dataset for Brazilian Portuguese by Leite et al (2020), and UIT-ViCTSD, a dataset about constructive and toxic speech detection for Vietnamese (Nguyen et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…KHateSpeech dataset was published by Jihyung Moon, et al in 2020 [38]. They scrapped malicious replies from news articles from entertainment and celebrity sections, where the largest amount of negative comments are produced.…”
Section: Khatespeechmentioning
confidence: 99%