2018
DOI: 10.1007/s10590-018-9220-z
|View full text |Cite
|
Sign up to set email alerts
|

Human versus automatic quality evaluation of NMT and PBSMT

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
2

Relationship

3
5

Authors

Journals

citations
Cited by 34 publications
(22 citation statements)
references
References 19 publications
0
22
0
Order By: Relevance
“…As I note in Way (2018), it simply cannot be the case that a 2-point improvement in BLEU scorealmost an irrelevance on a real industrial translation use-case which was typically seen in WMT-2016 where NMT systems swept the board on all tasks and language-pairs , can be reflective of the improvements in word order and lexical selection noted by Bentivogli et al (2016). Note that Shterionov et al (2018) actually computed the degree of underestimation in quality of three popular automatic evaluation metrics -BLEU, METEOR and TERshowing that for NMT, this may be up to 50%.…”
Section: Iii2 Does Mt Evaluation Need To Change With Nmt Coming Onstmentioning
confidence: 99%
“…As I note in Way (2018), it simply cannot be the case that a 2-point improvement in BLEU scorealmost an irrelevance on a real industrial translation use-case which was typically seen in WMT-2016 where NMT systems swept the board on all tasks and language-pairs , can be reflective of the improvements in word order and lexical selection noted by Bentivogli et al (2016). Note that Shterionov et al (2018) actually computed the degree of underestimation in quality of three popular automatic evaluation metrics -BLEU, METEOR and TERshowing that for NMT, this may be up to 50%.…”
Section: Iii2 Does Mt Evaluation Need To Change With Nmt Coming Onstmentioning
confidence: 99%
“…The former is typically achieved by comparing an MT output to a reference translation and generating metrics such as BLEU [7], whereas the latter concerns the human assessment of an MT output, such as the annotations of errors or the ranking of translations. Although manual assessment is much more time-consuming, its automatic variant has often been criticized as being problematic [8], having the tendency to underestimate the quality of NMT systems [9], or not even being a representation of the actual quality of the translation [1]. Sceptical of the reliability of automatic assessment, Matusov [1] performed a manual assessment of literary NMT with his own classification system, specifically designed for NMT.…”
Section: Related Researchmentioning
confidence: 99%
“…Wu et al 2016;Junczys-Dowmunt et al 2016;Crego et al 2016;Castilho et al 2017). However, recent research suggests that automatic metrics may not always be suitable: Shterionov et al (2018) compared automatic evaluation scores to human evaluation and noticed that the automatic scores underestimated the quality of NMT systems. Other studies, like Castilho et al (2017), report mixed results using different automatic (HTER, BLEU) and human evaluation metrics (fluency, adequacy) -SMT systems outperformed NMT in two case studies out of three -and point out that results vary depending on domain and language pair.…”
Section: Nmt Output and Qualitymentioning
confidence: 99%