2022
DOI: 10.1145/3485766
|View full text |Cite
|
Sign up to set email alerts
|

A Survey of Evaluation Metrics Used for NLG Systems

Abstract: In the last few years, a large number of automatic evaluation metrics have been proposed for evaluating Natural Language Generation (NLG) systems. The rapid development and adoption of such automatic evaluation metrics in a relatively short time has created the need for a survey of these metrics. In this survey, we (i) highlight the challenges in automatically evaluating NLG systems, (ii) propose a coherent taxonomy for organising existing evaluation metrics, (iii) briefly describe different existing metrics, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 72 publications
(44 citation statements)
references
References 111 publications
0
23
0
Order By: Relevance
“…For the evaluation, we developed an automated framework and utilized both automatic and human-based rankings. We used popular metrics, such as BLEU [25] and ROUGE [26,27], for automatic evaluation. These metrics are widely used for Natural Language Generation (NLG) tasks, including AQG, as they calculate the n-gram similarity between the reference sentence and the generated questions.…”
Section: Experimental Analysis and Evaluation Resultsmentioning
confidence: 99%
“…For the evaluation, we developed an automated framework and utilized both automatic and human-based rankings. We used popular metrics, such as BLEU [25] and ROUGE [26,27], for automatic evaluation. These metrics are widely used for Natural Language Generation (NLG) tasks, including AQG, as they calculate the n-gram similarity between the reference sentence and the generated questions.…”
Section: Experimental Analysis and Evaluation Resultsmentioning
confidence: 99%
“…Human assessments provide a complete picture of response generating performance, particularly in the generation of humanlike response [60]. We leverage the human evaluation questions from [8], which covers engaging, interesting, humanlike, and knowledgeable, for response generation assessments.…”
Section: ) Human Evaluationsmentioning
confidence: 99%
“…A total of 18 automatic metrics are tested against statistics produced by the human judgements of our criteria: post-edit times, number of incorrect statements, and number of omissions. Following the taxonomies reported by Celikyilmaz et al (2020) andSai et al (2020), the metrics considered can be loosely grouped in:…”
Section: Correlation With Automatic Metricsmentioning
confidence: 99%