Results of the WMT16 Metrics Shared Task

Bojar, Ondřej; Graham, Yvette; Kamran, Amir; Stanojević, Miloš

doi:10.18653/v1/w16-2302

Cited by 87 publications

(101 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such meta-evaluation commonly takes the form of the degree to which metrics scores correlate with human assessment. In MT, the stronger the correlation of a metric with human assessment, the better the metric is considered to be [12].…”

Section: Human Evaluation In Machine Translationmentioning

confidence: 99%

Evaluation of automatic video captioning using direct assessment

2018

Self Cite

View full text Add to dashboard Cite

We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowd sourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and scales to where there are many caption-generation techniques to be evaluated including the TRECVid video-to-text task in 2017.

show abstract

Section: Human Evaluation In Machine Translationmentioning

confidence: 99%

Evaluation of automatic video captioning using direct assessment

2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Below we describe the obtained results for new-stest2016 (Bojar et al, 2016b) and compare them with results of metrics tasks. At the time of publication of the article, results of newstest2019 were not yet available.…”

Section: Resultsmentioning

confidence: 99%

Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings

Yankovskaya¹,

Tättar²,

Fishel³

2019

Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

View full text Add to dashboard Cite

We propose the use of pre-trained embeddings as features of a regression model for sentencelevel quality estimation of machine translation. In our work we combine freely available BERT and LASER multilingual embeddings to train a neural-based regression model. In the second proposed method we use as an input features not only pre-trained embeddings, but also log probability of any machine translation (MT) system. Both methods are applied to several language pairs and are evaluated both as a classical quality estimation system (predicting the HTER score) as well as an MT metric (predicting human judgements of translation quality).

show abstract

“…1 CHRF3 (Popović, 2015) 2 SIMPBLEU-RECALL (Song et al, 2013) 3 NIST (Doddington, 2002) 4 BEER (Stanojević and Sima'an, 2014) Table 3: The preliminary results of the WMT16 metrics task: Absolute Pearson correlation of out-ofEnglish and to-English system-level metric scores. All results are cited from (Bojar et al, 2016).…”

Section: Comparison With Other Metricsmentioning

confidence: 99%

CharacTer: Translation Edit Rate on Character Level

Wang¹,

Peter²,

Rosendahl³

et al. 2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

View full text Add to dashboard Cite

Recently, the capability of character-level evaluation measures for machine translation output has been confirmed by several metrics. This work proposes translation edit rate on character level (CharacTER), which calculates the character level edit distance while performing the shift edit on word level. The novel metric shows high system-level correlation with human rankings, especially for morphologically rich languages. It outperforms the strong CHRF by up to 7% correlation on different metric tasks. In addition, we apply the hypothesis sentence length for normalizing the edit distance in CharacTER, which also provides significant improvements compared to using the reference sentence length.

show abstract

Results of the WMT16 Metrics Shared Task

Cited by 87 publications

References 21 publications

Evaluation of automatic video captioning using direct assessment

Evaluation of automatic video captioning using direct assessment

Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings

CharacTer: Translation Edit Rate on Character Level

Contact Info

Product

Resources

About