Proceedings of the 15th European Workshop on Natural Language Generation (ENLG) 2015
DOI: 10.18653/v1/w15-4708
|View full text |Cite
|
Sign up to set email alerts
|

A Snapshot of NLG Evaluation Practices 2005 - 2014

Abstract: In this paper we present a snapshot of endto-end NLG system evaluations as presented in conference and journal papers 1 over the last ten years in order to better understand the nature and type of evaluations that have been undertaken. We find that researchers tend to favour specific evaluation methods, and that their evaluation approaches are also correlated with the publication venue. We further discuss what factors may influence the types of evaluation used for a given NLG system.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

8
42
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 47 publications
(56 citation statements)
references
References 2 publications
8
42
1
Order By: Relevance
“…This paper shows that state-of-the-art automatic evaluation metrics for NLG systems do not sufficiently reflect human ratings, which stresses the need for human evaluations. This result is opposed to the current trend of relying on automatic evaluation identified in (Gkatzia and Mahamood, 2015).…”
Section: Discussioncontrasting
confidence: 65%
“…This paper shows that state-of-the-art automatic evaluation metrics for NLG systems do not sufficiently reflect human ratings, which stresses the need for human evaluations. This result is opposed to the current trend of relying on automatic evaluation identified in (Gkatzia and Mahamood, 2015).…”
Section: Discussioncontrasting
confidence: 65%
“…Only three papers (3%) in the sample of INLG and ACL papers presented an extrinsic evaluation. This is a notable decrease from Gkatzia and Mahamood (2015), who found that nearly 25% of studies contained an extrinsic evaluation. Of course, extrinsic evaluation is the most time-and cost-intensive out of all possible evaluations (Gatt and Krahmer, 2018), which might explain the rarity, but does not explain the decline in (relative) frequency.…”
Section: Intrinsic and Extrinsic Evaluationmentioning
confidence: 87%
“…Previous studies have also provided overviews of evaluation methods. Gkatzia and Mahamood (2015) focused on NLG papers from 2005-2014; Amidei et al (2018a) provided a 2013-2018 overview of evaluation in question generation; and Gatt and Krahmer (2018) provided a more general survey of the state-of-the-art in NLG. However, the aim of these papers was to give a structured overview of existing methods, rather than discuss shortcomings and best practices.…”
Section: Introductionmentioning
confidence: 99%
“…This includes a novel application of textual measures 18 and a novel usage of standard word-overlap metrics to assess similarity among individual systems. Automatic metrics are popular in NLG (Gkatzia and Mahamood, 2015) because they are cheaper and faster to run than human evaluation. However, sole use of automatic metrics is only sensible if they are known to be sufficiently correlated with human preferences.…”
Section: Evaluation Setupmentioning
confidence: 99%