Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics

Graham, Yvette; Li, Qun

doi:10.18653/v1/n16-1001

Cited by 15 publications

(16 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our analysis also shows that the existing metrics both theoretically and empirically differ from each other with significant differences. Compared to the recent results of significance testing of machine translation and summarization metrics (Graham and Baldwin, 2014; Graham and Liu, 2016;, our results suggest that there remains much room for improvement in developing more effective image captioning evaluation metrics. We leave this for future work, but a very naive idea would be combining different metrics into a unified metric and we simply test this idea using score combination, after normalizing the score of each metric to the range [0, 1].…”

Section: Discussioncontrasting

confidence: 72%

Re-evaluating Automatic Metrics for Image Captioning

Kilickaya¹,

Erdem²,

Ikizler-Cinbis³

et al. 2017

Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 1

123

127

View full text Add to dashboard Cite

The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore the utilization of the recently proposed Word Mover's Distance (WMD) document metric for the purpose of image captioning. Our findings outline the differences and/or similarities between metrics and their relative robustness by means of extensive correlation, accuracy and distraction based evaluations. Our results also demonstrate that WMD provides strong advantages over other metrics.

show abstract

Section: Discussioncontrasting

confidence: 72%

Re-evaluating Automatic Metrics for Image Captioning

Kilickaya¹,

Erdem²,

Ikizler-Cinbis³

et al. 2017

Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 1

123

127

View full text Add to dashboard Cite

show abstract

“…To strenghten the conclusions of our evaluation, we include significance test results for large hybrid-super-samples of systems (Graham and Liu, 2016). 10K hybrid systems were created per language pair, with corresponding DA human assessment scores by sampling pairs of systems from WMT18 News Translation Task, creating hybrid systems by randomly selecting each candidate translation from one of the two selected systems.…”

Section: System-level Resultsmentioning

confidence: 99%

“…Hybrid Systems are created automatically with the aim of providing a larger set of systems against which to evaluate metrics, as in Graham and Liu (2016). Hybrid systems were created for new-stest2018 by randomly selecting a pair of MT systems from all systems taking part in that language pair and producing a single output document by randomly selecting sentences from either of the two systems.…”

Section: Translation Systemsmentioning

confidence: 99%

Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

Ma¹,

Bojar²,

Graham³

2018

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Self Cite

View full text Add to dashboard Cite

This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic metrics. We collected scores of 10 metrics and 8 research groups. In addition to that, we computed scores of 8 standard metrics (BLEU, SentBLEU, chrF, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric's scores correlate with WMT18 official manual ranking of systems) and in terms of segment-level correlation (how often a metric agrees with humans in judging the quality of a particular sentence relative to alternate outputs). This year, we employ a single kind of manual evaluation: direct assessment (DA).

show abstract

“…We also include significance test results for large hybrid-super-samples of systems (Graham and Liu, 2016). 10K hybrid systems were created per language pair, with corresponding DA human assessment scores by sampling pairs of systems from WMT17 translation task and NMT training task, creating hybrid systems by randomly selecting each candidate translation from one of the two selected systems.…”

Section: System-level Results For Newsmentioning

confidence: 99%

Results of the WMT17 Metrics Shared Task

Bojar¹,

Graham²,

Kamran³

2017

Proceedings of the Second Conference on Machine Translation

Self Cite

104

View full text Add to dashboard Cite

This paper presents the results of the WMT17 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT17 news translation task and Neural MT training task. We collected scores of 14 metrics from 8 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric's scores correlate with WMT17 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in judging the quality of a particular sentence).This year, we build upon two types of manual judgements: direct assessment (DA) and HUME manual semantic judgements.

show abstract

Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics

Cited by 15 publications

References 6 publications

Re-evaluating Automatic Metrics for Image Captioning

Re-evaluating Automatic Metrics for Image Captioning

Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

Results of the WMT17 Metrics Shared Task

Contact Info

Product

Resources

About