2021
DOI: 10.48550/arxiv.2109.07445
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Challenges in Detoxifying Language Models

Abstract: Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
38
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(38 citation statements)
references
References 45 publications
0
38
0
Order By: Relevance
“…Source: https://bit.ly/3zBLnry. On the other hand, as recently demonstrated in studies such as [60,61], granular safe filtering of the datasets created and the downstream detoxification of the models trained on such datasets remain a tenuous and laborious work. When one juxtaposes the financial compensation levels and investments that went into the teams that have undertaken these detoxification challenges, the asymmetry becomes even more stark.…”
Section: 1 Asymmetry Of Efforts: Crawling V/s Detoxificationmentioning
confidence: 99%
“…Source: https://bit.ly/3zBLnry. On the other hand, as recently demonstrated in studies such as [60,61], granular safe filtering of the datasets created and the downstream detoxification of the models trained on such datasets remain a tenuous and laborious work. When one juxtaposes the financial compensation levels and investments that went into the teams that have undertaken these detoxification challenges, the asymmetry becomes even more stark.…”
Section: 1 Asymmetry Of Efforts: Crawling V/s Detoxificationmentioning
confidence: 99%
“…Asian Chinese (23) , slim (29) , yellow (39) , Japanese (50) , average (55) , straight (70) , inscrutable (72) , desirable (77) , feminine (88) , pleasant (91) Black civil (29) , lazy (44) , immoral (53) , animalistic (54) , capable (66) , equal (73) , stupid (74) , lower (78) , athletic (88) , incapable (82) White fair (62) , true (68) , ultimate (71) , higher (72) , virtuous (74) , racist (79) , non-white (82) , civilized (83) , pale (90) , responsible (92)…”
Section: Male Identifiersmentioning
confidence: 99%
“…Gender Analysis Encouragingly, we note that, for gender, among the top 100 most frequent adjectives, almost 80 were exactly the same, as shown in Figure 3. In the figure, words are ordered left-to-right in order Male top (51) , violent (53) , eccentric (59) , military (60) , polite (62) , serious (63) , national (67) , different (68) , aggressive (71) , right (78) Female beautiful (2) , attractive (37) , female (45) , mental (50) , sweet (57) , charitable (60) , perfect (62) , slim (67) , only (72) , excited (74) Table 8: Top 10 distinct words with the highest frequency from the 100 most frequent words that occurred for Male and Female identifiers. The numbers in parenthesis represent the word's ordinal position in the top 100 most frequent words list.…”
Section: Male Identifiersmentioning
confidence: 99%
See 1 more Smart Citation
“…Experiments conducted on GPT-3 (which has been trained on 570GB of text data from Common Crawl) show that the model may generate toxic sentences even when prompted with non-toxic text [11]. Although applying filtering of training data using automated toxicity scores may introduce classifier-specific biases [12], this technique remains more effective than decoder-based detoxification using methods such as swear word filters, PPLM [13], soft prompt tuning [14] or toxicity control tokens [15].…”
Section: Introductionmentioning
confidence: 99%