Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1169
|View full text |Cite
|
Sign up to set email alerts
|

Unifying Human and Statistical Evaluation for Natural Language Generation

Abstract: How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
93
0
3

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 131 publications
(96 citation statements)
references
References 37 publications
0
93
0
3
Order By: Relevance
“…More recently, Zhang et al (2020) leverage large pretrained language models (BERT, Devlin et al, 2019) to relax the limitation of exact n-gram overlap. Hashimoto et al (2019) combine human judgement with system-reported likelihood of generated text to make population-level estimates of quality and diversity. However, most existing metrics either evaluate generated text against very few references, or provide only relative ranking for multiple systems at a population level rather than reliable feedback for each example.…”
Section: Related Workmentioning
confidence: 99%
“…More recently, Zhang et al (2020) leverage large pretrained language models (BERT, Devlin et al, 2019) to relax the limitation of exact n-gram overlap. Hashimoto et al (2019) combine human judgement with system-reported likelihood of generated text to make population-level estimates of quality and diversity. However, most existing metrics either evaluate generated text against very few references, or provide only relative ranking for multiple systems at a population level rather than reliable feedback for each example.…”
Section: Related Workmentioning
confidence: 99%
“…Nevertheless, most natural language generation papers evaluate only one decoding algorithm -this is often due to the time and expense required for human evaluation. For example, Fan et al use top-k sampling (a decoding algorithm in which k governs the quality-diversity tradeoff), but only evaluate one value of k. However, evaluating one k gives an incomplete view of the generation system -several researchers have emphasized the importance of evaluating generation systems over the entire quality-diversity spectrum, rather than a single point on it (Caccia et al, 2018;Hashimoto et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…See et al (2019) find that flaws in language generation can be traced back to the choice of decoding method, rather than model architecture or insufficient training. The choice of decoding method can be seen as a trade-off between diversity and quality (Sun, Schuster & Shmatikov, 2020;Hashimoto, Zhang & Liang, 2019), where sampling from the full distribution leads to diverse, but poor-quality text as perceived by humans, while a likelihood-maximizing sampling method generating only from the most probable tokens leads to high-quality text that lacks diversity and is unnaturally repetitive. Holtzman et al (2019) find the problem of sampling from the full distribution in the increased cumulative likelihood of picking an individually highly unlikely token, causing downward-spirals of text quality which are easy to notice for human readers.…”
Section: Language Generationmentioning
confidence: 99%