2018
DOI: 10.1162/coli_a_00322
|View full text |Cite
|
Sign up to set email alerts
|

A Structured Review of the Validity of BLEU

Abstract: The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

7
166
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 244 publications
(173 citation statements)
references
References 12 publications
7
166
0
Order By: Relevance
“…Finally, it must be noted that the results using automatic metrics are quite different from results obtained in human evaluation (see Section 8.4), which confirms previous findings (Novikova et al, 2017a;Reiter, 2018). Table 9 summarises results from a range of textual metrics which aim to assess the complexity and diversity of primary system outputs (cf.…”
Section: Word-overlap Metricssupporting
confidence: 76%
See 1 more Smart Citation
“…Finally, it must be noted that the results using automatic metrics are quite different from results obtained in human evaluation (see Section 8.4), which confirms previous findings (Novikova et al, 2017a;Reiter, 2018). Table 9 summarises results from a range of textual metrics which aim to assess the complexity and diversity of primary system outputs (cf.…”
Section: Word-overlap Metricssupporting
confidence: 76%
“…However, sole use of automatic metrics is only sensible if they are known to be sufficiently correlated with human preferences. Recent studies (Novikova et al, 2017a;Reiter, 2018) have demonstrated that this is very often not the case and that automatic metrics only weakly reflect human judgements on system outputs as generated by data-driven NLG. Therefore, we also performed a large-scale crowdsourced human evaluation, as detailed in Section 7.2.…”
Section: Evaluation Setupmentioning
confidence: 99%
“…Its weaknesses abound, and much has been written about them (cf. Callison-Burch et al (2006); Reiter (2018)). This paper is not, however, concerned with the shortcomings of BLEU as a proxy for human evaluation of quality; instead, our goal is to bring attention to the relatively narrower problem of the reporting of BLEU scores.…”
Section: Introductionmentioning
confidence: 99%
“…We experimentally demonstrate benefits of the proposed model, in particular showing that the ENN performs much better than the traditional Bayesian neural networks (BNN). The EnRML in ENN is a substitution of gradient-based optimization algorithms, which means that it can be directly combined with the feed-forward process in other existing (deep) neural networks, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), broadening future applications of the ENN.3 the quality of text that has been machine-translated (Papineni et al, 2002;Reiter, 2018). However, it is difficult to build a loss function based on this evaluation criterion since it is not differentiable.…”
mentioning
confidence: 99%
“…3 the quality of text that has been machine-translated (Papineni et al, 2002;Reiter, 2018). However, it is difficult to build a loss function based on this evaluation criterion since it is not differentiable.…”
mentioning
confidence: 99%