2020
DOI: 10.48550/arxiv.2010.02140
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Abstract: The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(5 citation statements)
references
References 9 publications
0
5
0
Order By: Relevance
“…Similarly, as shown in figure 4, the model was able to meet or exceed human performance for "Spot the Bot" [5] Turing evaluations for 67.16% of the validation set. In fact, 3 response data-points even coded for human responses being most likely generated, and in two of which the model achieved a higher score.…”
Section: Results and Validationmentioning
confidence: 54%
See 4 more Smart Citations
“…Similarly, as shown in figure 4, the model was able to meet or exceed human performance for "Spot the Bot" [5] Turing evaluations for 67.16% of the validation set. In fact, 3 response data-points even coded for human responses being most likely generated, and in two of which the model achieved a higher score.…”
Section: Results and Validationmentioning
confidence: 54%
“…After training, the model was evaluated based on variants of two sets of published metrics. Singleblinded, independently coded responses show that the model was able to synthesize utterances at or above human level 59.7% and 67.16% of the time for RQI (response quality index) [15] and a comparative Turing test ("Spot the Bot", [5]) respectively for a test set of 134 validation prompt/response pairs. Of the samples where human responses did outperform synthesized responses, only 17.9% and 20.9% of the subset did so significantly.…”
Section: Conclusion and Discussionmentioning
confidence: 99%
See 3 more Smart Citations