A Structured Review of the Validity of BLEU

Reiter, Ehud

doi:10.1162/coli_a_00322

Cited by 244 publications

(173 citation statements)

References 12 publications

Supporting

Mentioning

166

Contrasting

Order By: Relevance

“…Finally, it must be noted that the results using automatic metrics are quite different from results obtained in human evaluation (see Section 8.4), which confirms previous findings (Novikova et al, 2017a;Reiter, 2018). Table 9 summarises results from a range of textual metrics which aim to assess the complexity and diversity of primary system outputs (cf.…”

Section: Word-overlap Metricssupporting

confidence: 76%

“…However, sole use of automatic metrics is only sensible if they are known to be sufficiently correlated with human preferences. Recent studies (Novikova et al, 2017a;Reiter, 2018) have demonstrated that this is very often not the case and that automatic metrics only weakly reflect human judgements on system outputs as generated by data-driven NLG. Therefore, we also performed a large-scale crowdsourced human evaluation, as detailed in Section 7.2.…”

Section: Evaluation Setupmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating the state-of-the-art of End-to-End Natural Language Generation: The E2E NLG challenge

Dušek

Novikova

Rieser

2020

Computer Speech & Language

174

183

View full text Add to dashboard Cite

This paper provides a comprehensive analysis of the first shared task on End-to-End Natural Language Generation (NLG) and identifies avenues for future research based on the results. This shared task aimed to assess whether recent end-to-end NLG systems can generate more complex output by learning from datasets containing higher lexical richness, syntactic complexity and diverse discourse phenomena. Introducing novel automatic and human metrics, we compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures -with the majority implementing sequence-to-sequence models (seq2seq) -as well as systems based on grammatical rules and templates. Seq2seq-based systems have demonstrated a great potential for NLG in the challenge. We find that seq2seq systems generally score high in terms of word-overlap metrics and human evaluations of naturalness -with the winning Slug system (Juraska et al., 2018) being seq2seq-based.However, vanilla seq2seq models often fail to correctly express a given meaning representation if they lack a strong semantic control mechanism applied during decoding.Moreover, seq2seq models can be outperformed by hand-engineered systems in terms of overall quality, as well as complexity, length and diversity of outputs. This research has influenced, inspired and motivated a number of recent studies outwith the original competition, which we also summarise as part of this paper.

show abstract

Section: Word-overlap Metricssupporting

confidence: 76%

Section: Evaluation Setupmentioning

confidence: 99%

Evaluating the state-of-the-art of End-to-End Natural Language Generation: The E2E NLG challenge

Dušek

Novikova

Rieser

2020

Computer Speech & Language

174

183

View full text Add to dashboard Cite

show abstract

“…Its weaknesses abound, and much has been written about them (cf. Callison-Burch et al (2006); Reiter (2018)). This paper is not, however, concerned with the shortcomings of BLEU as a proxy for human evaluation of quality; instead, our goal is to bring attention to the relatively narrower problem of the reporting of BLEU scores.…”

Section: Introductionmentioning

confidence: 99%

A Call for Clarity in Reporting BLEU Scores

Post¹

2018

Proceedings of the Third Conference on Machine Translation: Research Papers

1,514

916

View full text Add to dashboard Cite

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for usersupplied reference processing, and provide a new tool, SACREBLEU, 1 to facilitate this.

show abstract

“…We experimentally demonstrate benefits of the proposed model, in particular showing that the ENN performs much better than the traditional Bayesian neural networks (BNN). The EnRML in ENN is a substitution of gradient-based optimization algorithms, which means that it can be directly combined with the feed-forward process in other existing (deep) neural networks, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), broadening future applications of the ENN.3 the quality of text that has been machine-translated (Papineni et al, 2002;Reiter, 2018). However, it is difficult to build a loss function based on this evaluation criterion since it is not differentiable.…”

mentioning

confidence: 99%

“…3 the quality of text that has been machine-translated (Papineni et al, 2002;Reiter, 2018). However, it is difficult to build a loss function based on this evaluation criterion since it is not differentiable.…”

mentioning

confidence: 99%

Ensemble Neural Networks (ENN): A gradient-free stochastic method

et al. 2019

View full text Add to dashboard Cite

In this study, an efficient stochastic gradient-free method, the ensemble neural networks (ENN), is developed. In the ENN, the optimization process relies on covariance matrices rather than derivatives. The covariance matrices are calculated by the ensemble randomized maximum likelihood algorithm (EnRML), which is an inverse modeling method. The ENN is able to simultaneously provide estimations and perform uncertainty quantification since it is built under the Bayesian framework. The ENN is also robust to small training data size because the ensemble of stochastic realizations essentially enlarges the training dataset. This constitutes a desirable characteristic, especially for real-world engineering applications. In addition, the ENN does not require the calculation of gradients, which enables the use of complicated neuron models and loss functions in neural networks. We experimentally demonstrate benefits of the proposed model, in particular showing that the ENN performs much better than the traditional Bayesian neural networks (BNN). The EnRML in ENN is a substitution of gradient-based optimization algorithms, which means that it can be directly combined with the feed-forward process in other existing (deep) neural networks, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), broadening future applications of the ENN.3 the quality of text that has been machine-translated (Papineni et al., 2002;Reiter, 2018). However, it is difficult to build a loss function based on this evaluation criterion since it is not differentiable. However, this will no longer pose a problem if we can find a gradient-free optimization method.Considering the aforementioned problems, a salient question is: are there any alternatives for the optimization method in a neural network that are able to perform uncertainty analysis and perform well with a small dataset, but do not rely on derivative calculations?These obstacles are encountered in numerous engineering fields, such as petroleum engineering. Uncertainty quantification is critical because underground geological parameters are highly heterogeneous. High-dimension models are always solved based on a small dataset due to the expensive and time-consuming data collection. It is also difficult to identify gradients of a target variable with respect to model parameters because the corresponding physical models are highly nonlinear and too complicated to solve analytically. In response to these problems, the ensemble randomized maximum likelihood algorithm (EnRML) is proposed by Gu and Oliver (2007) in the field of history matching in petroleum engineering. History matching is an inverse modeling method, which adjusts a model of a reservoir until it closely reproduces its past behavior (Oliver et al., 2008;Stordal & Naevdal, 2018). It should be mentioned that the word "ensemble" here indicates a different meaning from that in ensemble averaging (Naftaly et al., 1997). In the former, it means the ensemble of realizations generated from the same model, rather tha...

show abstract

A Structured Review of the Validity of BLEU

Cited by 244 publications

References 12 publications

Evaluating the state-of-the-art of End-to-End Natural Language Generation: The E2E NLG challenge

Evaluating the state-of-the-art of End-to-End Natural Language Generation: The E2E NLG challenge

A Call for Clarity in Reporting BLEU Scores

Ensemble Neural Networks (ENN): A gradient-free stochastic method

Contact Info

Product

Resources

About