NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online

Ye, Yiran; Le, Thai Hoang; Lee, Dongwon

doi:10.48550/arxiv.2303.10430

Cited by 1 publication

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior works on toxic content detection can be categorized into two types. One type of research works focuses on creating benchmark datasets for toxic content detection, either by crowdsourcing and annotating human-written text (Ye, Le, and Lee 2023;Sap et al 2019;Vidgen et al 2020), or leveraging ML-based approaches to generate high-quality toxic dataset in a scalable way (Hartvigsen et al 2022). Another type of works proposes novel approaches to fine-tune LMs on toxic dataset.…”

Section: Related Work Toxic Content Detectionmentioning

confidence: 99%

Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models

Zhang,

Wu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Toxic content detection is crucial for online services to remove inappropriate content that violates community standards. To automate the detection process, prior works have proposed varieties of machine learning (ML) approaches to train Language Models (LMs) for toxic content detection. However, both their accuracy and transferability across datasets are limited. Recently, Large Language Models (LLMs) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ML tasks. However, efficiently designing prompts for LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder their deployments in production. To address these challenges, in this work, we propose BD-LLM, a novel and efficient approach to bootstrapping and distilling LLMs for toxic content detection. Specifically, we design a novel prompting method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection performance and extract high-quality rationales. DToT can automatically select more fine-grained context to re-prompt LLMs when their responses lack confidence. Additionally, we use the rationales extracted via DToT to fine-tune student LMs. Our experimental results on various datasets demonstrate that DToT can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs fine-tuned with rationales extracted via DToT outperform baselines on all datasets with up to 16.9% accuracy improvement, while being more than 60x smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned with rationales exhibit better cross-dataset transferability.

show abstract

Section: Related Work Toxic Content Detectionmentioning

confidence: 99%