2021
DOI: 10.48550/arxiv.2106.05555
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Shades of BLEU, Flavours of Success: The Case of MultiWOZ

Abstract: The MultiWOZ dataset (Budzianowski et al., 2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform & Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or underspecified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…We evaluated the performance of PEFTTOD in the context of task-oriented dialogue systems for end-to-end dialogue modeling [45]. The evaluation was conducted using the benchmark dataset Multi-WOZ 2.0 [12].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…We evaluated the performance of PEFTTOD in the context of task-oriented dialogue systems for end-to-end dialogue modeling [45]. The evaluation was conducted using the benchmark dataset Multi-WOZ 2.0 [12].…”
Section: Discussionmentioning
confidence: 99%
“…PEFTTOD was trained on the Multi-WOZ 2.0 dataset, specifically on the task of the end-to-end dialogue modeling [45]. The proposed system was trained using the maximum likelihood method, a common approach in machine learning, which aims to optimize the model's parameters by maximizing the likelihood of generating the correct outputs given the inputs.…”
Section: End-to-end Dialogue Modelingmentioning
confidence: 99%
“…Despite the availability of toolkits that facilitate user simulation (US) evaluation (Zhu et al, 2020), corpus-based match and success rates are the default benchmark for works in task-oriented dialogue systems today (Budzianowski et al, 2018;Nekvinda and Dušek, 2021). These metrics are practical to compute, reproducible, and scalable.…”
Section: Related Workmentioning
confidence: 99%