2020
DOI: 10.3390/app10030762
|View full text |Cite
|
Sign up to set email alerts
|

Human Annotated Dialogues Dataset for Natural Conversational Agents

Abstract: Conversational agents are gaining huge popularity in industrial applications such as digital assistants, chatbots, and particularly systems for natural language understanding (NLU). However, a major drawback is the unavailability of a common metric to evaluate the replies against human judgement for conversational agents. In this paper, we develop a benchmark dataset with human annotations and diverse replies that can be used to develop such metric for conversational agents. The paper introduces a high-quality… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 25 publications
0
12
0
Order By: Relevance
“…Lastly, the evaluation datasets should cover responses of a wide quality spectrum. In total, we have adopted six different publicly-available dialogue evaluation datasets with each accounting for a dialogue domain for assessing MDD-Eval 4 : DailyDialog-Eval (Zhao, Lala, and Kawahara 2020), Persona-Eval (Zhao, Lala, and Kawahara 2020), Topical-Eval (Mehri and Eskenazi 2020b), Movie-Eval (Merdivan et al 2020), Empathetic-Eval (Huang et al 2020 and Twitter-Eval (Hori and Hori 2017). Detailed statistics of each evaluation dataset is listed in Table 3.…”
Section: Evaluation Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…Lastly, the evaluation datasets should cover responses of a wide quality spectrum. In total, we have adopted six different publicly-available dialogue evaluation datasets with each accounting for a dialogue domain for assessing MDD-Eval 4 : DailyDialog-Eval (Zhao, Lala, and Kawahara 2020), Persona-Eval (Zhao, Lala, and Kawahara 2020), Topical-Eval (Mehri and Eskenazi 2020b), Movie-Eval (Merdivan et al 2020), Empathetic-Eval (Huang et al 2020 and Twitter-Eval (Hori and Hori 2017). Detailed statistics of each evaluation dataset is listed in Table 3.…”
Section: Evaluation Datasetsmentioning
confidence: 99%
“…However, the dialogue research heavily relies on the ability to evaluate system performance with automatic dialogue evaluation metrics (ADMs). Common natural language generation (NLG) metrics used in the dialogue system literature, such as BLEU (Papineni et al 2002) 2004), are unsuitable for the multi-domain dialogue evaluation task as they are shown to correlate poorly with human judgements (Liu et al 2016) due to the one-to-many contextresponse mapping in dialogues (Zhao, Zhao, and Eskenazi 2017) as well as the multi-faceted nature of dialogue evaluation (Mehri and Eskenazi 2020b). An alternative solution is to design model-based ADMs that explicitly learn to discriminate dialogue responses of varying quality.…”
Section: Introductionmentioning
confidence: 99%
“…• HUMOD-In the study [111], the unavailability of a common metric to evaluate the replies against human judgment is handled. This study contributes by developing a benchmark dataset with human annotation and diverse responses.…”
Section: Rq4 Which Techniques Are Effective For Reducing the Need For...mentioning
confidence: 99%
“…To fill the above gap, we curate a novel largescale silver dialogue dataset, EDOS (Emotional Dialogues in OpenSubtitles), containing 1M emotional dialogues from movie subtitles, in which each dialogue turn is automatically annotated with 32 fine-grained emotions, eight plus categories as well as the Neutral category. Movie subtitles are extensively used for emotion analysis in text in earlier and recent research (Kayhani et al, 2020;Merdivan et al, 2020;Giannakopoulos et al, 2009). The Nature article "How movies mirror our mimicry" (Ball, 2011) states "screenwriters mine everyday discourse to make dialogues appear authentic" and "audiences use language devices in movies to shape their own discourse".…”
Section: Speakermentioning
confidence: 99%