Effective Crowdsourced Generation of Training Data for Chatbots Natural Language Understanding

Bapat, Rucha; Kucherbaev, Pavel; Bozzon, Alessandro

doi:10.1007/978-3-319-91662-0_8

Cited by 12 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We based our crowdsourcing process upon the generic process for generating chatbot training data introduced by Bapat et al [3]. It comprises three main steps: 1) Preparatory work: clarify use cases for our medication chatbot and create entity-intent model, 2) Creation of orders in the crowdsourcing platform, 3) Collect and control the data.…”

Section: Methodsmentioning

confidence: 99%

Crowdsourcing for Creating a Dataset for Training a Medication Chatbot

Zgraggen

Kunz

Denecke

2021

Studies in Health Technology and Informatics

View full text Add to dashboard Cite

To facilitate interaction with mobile health applications, chatbots are increasingly used. They realize the interaction as a dialog where users can ask questions and get answers from the chatbot. A big challenge is to create a comprehensive knowledge base comprising patterns and rules for representing possible user queries the chatbot has to understand and interpret. In this work, we assess how crowdsourcing can be used for generating examples of possible user queries for a medication chatbot. Within one week, the crowdworker generated 2‘738 user questions. The examples provide a large variety of possible formulations and information needs. As a next step, these examples for user queries will be used to train our medication chatbot.

show abstract

Section: Methodsmentioning

confidence: 99%

Crowdsourcing for Creating a Dataset for Training a Medication Chatbot

Zgraggen

Kunz

Denecke

2021

Studies in Health Technology and Informatics

View full text Add to dashboard Cite

show abstract

“…Moreover, capitalization and article errors seems abundant (e.g., Example 5 in Table 1). Given that real bot users also make such errors, it is important to have linguistically incorrect utterances in the training samples (Bapat et al, 2018). However, at a very least, detecting linguistic errors can contribute to quality-aware selection of crowd workers.…”

Section: Linguistic Errorsmentioning

confidence: 99%

“…In some cases, workers misunderstood the task and provided translations in their own native languages (referred to as Translation issues) (Crossley et al, 2016;Braunger et al, 2018;Bapat et al, 2018), and some mistakenly thought they should provide answers for expressions phrased as questions (referred to as Answering issues) such as Example 9 in Table 1. This occurred even though workers were provided with comprehensive instructions and examples.…”

Section: Task Misunderstandingmentioning

confidence: 99%

A Study of Incorrect Paraphrases in Crowdsourced User Utterances

Yaghoub-Zadeh-Fard

Benatallah

Barukh

et al. 2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

Developing bots demands high quality training samples, typically in the form of user utterances and their associated intents. Given the fuzzy nature of human language, such datasets ideally must cover all possible utterances of each single intent. Crowdsourcing has widely been used to collect such inclusive datasets by paraphrasing an initial utterance. However, the quality of this approach often suffers from various issues, particularly language errors produced by unqualified crowd workers. More so, since workers are tasked to write open-ended text, it is very challenging to automatically asses the quality of paraphrased utterances. In this paper, we investigate common crowdsourced paraphrasing issues, and propose an annotated dataset called Para-Quality, for detecting the quality issues. We also investigate existing tools and services to provide baselines for detecting each category of issues. In all, this work presents a data-driven view of incorrect paraphrases during the bot development process, and we pave the way towards automatic detection of unqualified paraphrases.

show abstract

“…Bapat et al [2] already presented an end-to-end pipeline for simplifying the NLU training process, where the first sentences are defined and extended for the following training. While the extension of the training dataset is skipped and only classified into 5 categories of possible extension methods, our approach mainly targets the class of generating big pools of parameter values.…”

Section: Related Workmentioning

confidence: 99%

Improving NLU Training over Linked Data with Placeholder Concepts

Schmitt

Kulbach

Sure-Vetter

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Conversational systems, also known as dialogue systems, have become increasingly popular. They can perform a variety of tasks e.g. in B2C areas such as sales and customer services. A significant amount of research has already been conducted on improving the underlying algorithms of the natural language understanding (NLU) component of dialogue systems. This paper presents an approach to generate training datasets for the NLU component from Linked Data resources. We analyze how differently designed training datasets can impact the performance of the NLU component. Whereby, the training datasets differ mainly by varying values for the injection into fixed sentence patterns. As a core contribution, we introduce and evaluate the performance of different placeholder concepts. Our results show that a trained model with placeholder concepts is capable of handling dynamic Linked Data without retraining the NLU component. Thus, our approach also contributes to the robustness of the NLU component.

show abstract

Effective Crowdsourced Generation of Training Data for Chatbots Natural Language Understanding

Cited by 12 publications

References 22 publications

Crowdsourcing for Creating a Dataset for Training a Medication Chatbot

Crowdsourcing for Creating a Dataset for Training a Medication Chatbot

A Study of Incorrect Paraphrases in Crowdsourced User Utterances

Improving NLU Training over Linked Data with Placeholder Concepts

Contact Info

Product

Resources

About