BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Moon, Jihyung; Cho, Won Ik; Lee, Junbum

doi:10.18653/v1/2020.socialnlp-1.4

Cited by 47 publications

(30 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…KHS-The Korean Hate Speech (KHS) [62] dataset is a collection of 8367 news comments, labeled as one of "hate", "offensive", or "none" by human annotators. As gold labels for the test split are unavailable, we report the best f1-score on the validation split for this task.…”

Section: Datasetmentioning

confidence: 99%

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Lee

Yang

Whang³

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

show abstract

Section: Datasetmentioning

confidence: 99%

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Lee

Yang

Whang³

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Several datasets for classifying toxicity on toxic speech on online forums, such as the dataset provided by Waseem and Hovy (2016) for English, BEEP! dataset for Korean by Moon et al (2020), the dataset for Russian provided by Smetanin (2020), TolD-Br dataset for Brazilian Portuguese by Leite et al (2020), and UIT-ViCTSD, a dataset about constructive and toxic speech detection for Vietnamese (Nguyen et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches

Hoang¹,

Nguyễn²,

Nguyen³

2021

Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

View full text Add to dashboard Cite

The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing experiments. We solve this task by two approaches, Named Entity Recognition with spaCy's library and Question-Answering with RoBERTa combining with ToxicBERT, and the former gains the highest F1-score of 66.99%.

show abstract

“…KHateSpeech dataset was published by Jihyung Moon, et al in 2020 [38]. They scrapped malicious replies from news articles from entertainment and celebrity sections, where the largest amount of negative comments are produced.…”

Section: Khatespeechmentioning

confidence: 99%

A Survey on Awesome Korean NLP Datasets

Ban¹

2021

Preprint

View full text Add to dashboard Cite

English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.

show abstract

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Cited by 47 publications

References 16 publications

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches

A Survey on Awesome Korean NLP Datasets

Contact Info

Product

Resources

About