Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data and Methodology

Zhang, Yangjun; Ren, Pengjie; Rijke, Maarten de

doi:10.48550/arxiv.2008.09706

Cited by 4 publications

(4 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent workshop on trolling, aggression and cyberbullying (Kumar et al, 2020) proposed tasks on aggression identification and gendered identification. Zhang et al (2020) propose a widerranging hierarchical taxonomy of malevolent dialogue, defined as "a system-generated response that is grounded in negative emotion, inappropriate behavior or unethical value basis in terms of content and dialogue acts. 2018) measure gender biases on models trained with different abusive language datasets, and propose three methods to reduce bias: debiased word embeddings, gender swap data augmentation, and fine-tuning with a larger corpus.…”

Section: Scope Of Abusive Contentmentioning

confidence: 99%

Recipes for Safety in Open-domain Chatbots

Xu,

Ju,

et al. 2020

Preprint

View full text Add to dashboard Cite

Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of opendomain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our models.

show abstract

Section: Scope Of Abusive Contentmentioning

confidence: 99%

Recipes for Safety in Open-domain Chatbots

Xu,

Ju,

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Many existing mitigations rely on the ability to detect problematic content -often centred on content written by humans on social media platforms, such as Twitter (e.g. Waseem and Hovy, 2016;Wang et al, 2020;Zampieri et al, 2019Zampieri et al, , 2020Zhang et al, 2020), Facebook (Glavaš et al, 2020;Zampieri et al, 2020), or Reddit (Han and Tsvetkov, 2020;Zampieri et al, 2020). However, of course, conversational systems may not necessarily have the same patterns as social media content (Cercas Curry et al, 2021).…”

Section: Offensive Contentmentioning

confidence: 99%

Guiding the Release of Safer E2E Conversational AI through Value Sensitive Design

Bergman,

Abercrombie,

Spruit

et al. 2022

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

Over the last several years, end-to-end neural conversational agents have vastly improved their ability to carry unrestricted, open-domain conversations with humans. However, these models are often trained on large datasets from the Internet and, as a result, may learn undesirable behaviours from this data, such as toxic or otherwise harmful language. Thus, researchers must wrestle with how and when to release these models. In this paper, we survey recent and related work to highlight tensions between values, potential positive impact, and potential harms. We also provide a framework to support practitioners in deciding whether and how to release these models, following the tenets of value-sensitive design.

show abstract

“…Offensive system responses For offensive content generated by the systems themselves, Ram et al (2017) use keyword matching and machine learning methods to detect system responses that are profane, sexual, racially inflammatory, other hate speech, or violent. Zhang et al (2020b) develop a hierarchical classification framework for "malevolent" responses in dialogs (although their data is from Twitter rather than human-agent conversations). And apply the same classifier they used for detection of unsafe user input to system responses, in addition to proposing other methods of avoiding unsafe output (see below).…”

Section: Generating Offensive Content (Instigator Effect)mentioning

confidence: 99%

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

Dinan¹,

Abercrombie²,

Bergman³

et al. 2021

Preprint

View full text Add to dashboard Cite

Warning: this paper contains example data that may be offensive or upsetting.Over the last several years, end-to-end neural conversational agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large datasets from the internet, and as a result, may learn undesirable behaviors from this data, such as toxic or otherwise harmful language. Researchers must thus wrestle with the issue of how and when to release these models. In this paper, we survey the problem landscape for safety for end-to-end conversational AI and discuss recent and related work. We highlight tensions between values, potential positive impact and potential harms, and provide a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. We additionally provide a suite of tools to enable researchers to make better-informed decisions about training and releasing end-to-end conversational AI models.

show abstract

Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data and Methodology

Cited by 4 publications

References 52 publications

Recipes for Safety in Open-domain Chatbots

Recipes for Safety in Open-domain Chatbots

Guiding the Release of Safer E2E Conversational AI through Value Sensitive Design

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

Contact Info

Product

Resources

About