Linguistic Evaluation of German-English Machine Translation Using a Test Suite

Avramidis, Eleftherios; Macketanz, Vivien; Strohriegel, Ursula; Uszkoreit, Hans

doi:10.18653/v1/w19-5351

Cited by 16 publications

(17 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The third rule that we conform to is to 1) create two contrastive source sentences for each lexical or syntactic ambiguity point, where each source sentence corresponds to one reasonable interpretation of the ambiguity point, and 2) to provide two contrastive translations for each created source sentence. This is similar to other linguistic evaluation by contrastive examples in the MT literature (Avramidis et al, 2019;Bawden et al, 2018;Müller et al, 2018;Sennrich, 2017). These two contrastive translations have similar wordings: one is correct and the other is not correct in that it translates the ambiguity part into the corresponding translation of the contrastive source sentence.…”

Section: Test Suite Designsupporting

confidence: 83%

The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation

He¹,

Wang²,

Xiong³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy ( 60.1%) and reasoning consistency ( 31%). We will release our test suite as a machine translation commonsense reasoning testbed to promote future work in this direction.

show abstract

Section: Test Suite Designsupporting

confidence: 83%

The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation

He¹,

Wang²,

Xiong³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…German-to-English (Avramidis et al, 2019) The test suite by DFKI covers 107 grammatical phenomena organized into 14 categories. The test suite is very closely related to the one used last year (Macketanz et al, 2018), which allows an evaluation over time.…”

Section: Linguistic Evaluation Ofmentioning

confidence: 99%

Findings of the 2019 Conference on Machine Translation (WMT19)

Barrault¹,

Bojar²,

Costa-jussà³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

336

261

View full text Add to dashboard Cite

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

show abstract

“…In WMT 2019 English-German phenomena were tested with a new corpus, using both human and automatic evaluation. It is not possible, however, to use this evaluation outside the competition(Avramidis et al, 2019).…”

mentioning

confidence: 99%

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

Choshen¹,

Abend²

2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

We show that the state-of-the-art Transformer MT model is not biased towards monotonic reordering (unlike previous recurrent neural network models), but that nevertheless, longdistance dependencies remain a challenge for the model. Since most dependencies are shortdistance, common evaluation metrics will be little influenced by how well systems perform on them. We therefore propose an automatic approach for extracting challenge sets replete with long-distance dependencies, and argue that evaluation using this methodology provides a complementary perspective on system performance. To support our claim, we compile challenge sets for English-German and German-English, which are much larger than any previously released challenge set for MT. The extracted sets are large enough to allow reliable automatic evaluation, which makes the proposed approach a scalable and practical solution for evaluating MT performance on the long-tail of syntactic phenomena. 1

show abstract

Linguistic Evaluation of German-English Machine Translation Using a Test Suite

Cited by 16 publications

References 10 publications

The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation

The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation

Findings of the 2019 Conference on Machine Translation (WMT19)

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

Contact Info

Product

Resources

About