2020
DOI: 10.1007/978-981-15-8395-7_5
|View full text |Cite
|
Sign up to set email alerts
|

Deep AM-FM: Toolkit for Automatic Dialogue Evaluation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 27 publications
(32 citation statements)
references
References 14 publications
0
32
0
Order By: Relevance
“…However, the distribution of the number of constraint tokens in the experiments of these papers was much smaller than that of this task, and we found these methods did not perform well on this task. and Chen et al (2021) proposed lexically constrained decoding given explicit alignment guidance between the constraints and the source text. Alignments were induced from an additional alignment head or attention weights (Garg et al, 2019), but these methods assumed that gold alignments are given as constraints.…”
Section: Discussionmentioning
confidence: 99%
“…However, the distribution of the number of constraint tokens in the experiments of these papers was much smaller than that of this task, and we found these methods did not perform well on this task. and Chen et al (2021) proposed lexically constrained decoding given explicit alignment guidance between the constraints and the source text. Alignments were induced from an additional alignment head or attention weights (Garg et al, 2019), but these methods assumed that gold alignments are given as constraints.…”
Section: Discussionmentioning
confidence: 99%
“…Semantic appropriateness seeks to determine whether the response topically fit into its corresponding dialogue context. Many existing evaluation frameworks have focused on the two above-mentioned aspects [18]- [20]. Context coherence is concerned with the structure of the dialogue flow, such that the passage is expressed fluidly, clearly, and with sensible turn-taking [21].…”
Section: A Aspects Of Evaluationmentioning
confidence: 99%
“…For evaluation datasets from DSTC shared tasks, we compare D-score against referencebased baselines adopted by the task organizers. On the DSTC6 evaluation dataset, we also compare D-score against two more recently proposed reference-based metrics, AM-FM [19] and Deep AM-FM [18]. In addition, four state-of-the-art referencefree baselines, BERT NLI (BNLI) [11], contextualized RU-BER (Ctr-R) [7], GPT-2 [20] and USR [8], are included.…”
Section: B Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…For evaluation, as we have mentioned before, we use BLEU (Papineni et al, 2002) as the primary evaluation metric. WAT also uses metrics such as RIBES (Isozaki et al, 2010), AM-FM (Zhang et al, 2021) and human evaluation (Nakazawa et al, 2019(Nakazawa et al, , 2020(Nakazawa et al, , 2021. All these metrics focus on different aspects of translations and may lead to different rankings for submissions, however this multi-metric evaluation helps us understand that there may not be one perfect model.…”
Section: Training and Evaluationmentioning
confidence: 99%