2020
DOI: 10.1162/tacl_a_00347
|View full text |Cite
|
Sign up to set email alerts
|

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

Abstract: There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected respons… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 28 publications
(28 citation statements)
references
References 18 publications
0
28
0
Order By: Relevance
“…In terms of various dialog understanding tasks, our models achieve state-of-theart performances in several tasks (absolute improvements up to 8.6% in task accuracies) and perform consistently well across a variety of dialog understanding tasks at all scales, whereas baseline models usually have a rather imbalanced performance across tasks. Our models show the most promising performance on the DailyDialog++ (Sai et al 2020) dialog evaluation task.…”
Section: Introductionmentioning
confidence: 91%
See 3 more Smart Citations
“…In terms of various dialog understanding tasks, our models achieve state-of-theart performances in several tasks (absolute improvements up to 8.6% in task accuracies) and perform consistently well across a variety of dialog understanding tasks at all scales, whereas baseline models usually have a rather imbalanced performance across tasks. Our models show the most promising performance on the DailyDialog++ (Sai et al 2020) dialog evaluation task.…”
Section: Introductionmentioning
confidence: 91%
“…ConveRT (Henderson et al 2020) is trained on the response retrieval task using Reddit threads. DEB or Dialog Evaluation using BERT (Sai et al 2020) is a model based on extended pretraining of the BERT architecture using Reddit data. DialogRPT (Gao et al 2020), on the other hand, is pretrained to predict human-feedback (e.g., upvotes and downvotes) on comments to Reddit threads.…”
Section: Dialog System Pretrainingmentioning
confidence: 99%
See 2 more Smart Citations
“…The conversational dataset has variety of uses, e.g., improving the dialogue evaluation [2], emotion recognition [3], or use as dataset for study about personality and demographic of people on the Internet [4].…”
Section: Related Workmentioning
confidence: 99%