How do Humans Evaluate Machine Translation

Guzmán, Francisco; Abdelali, Ahmed; Temnikova, Irina; Sajjad, Hassan; Vogel, Stephan

doi:10.18653/v1/w15-3059

Cited by 12 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Pastaruoju atveju dažniausiai vertimas pateikiamas dvikalbiams vertintojams, mokantiems ir originalo, ir vertimo kalbą, kad jie suteiktų konkrečiam segmentui kokybės balą, pvz., nuo 1 = prastas iki 5 = puikus (žr. Guzmán et al 2015). Paprastai naudojami šie kriterijai: adekvatumas, t. y. prasmės išsaugojimas; sklandumas, t. y. gramatiškumas; bendra kokybė (pagrįsta abiejų kriterijų deriniu) ir numatomos postredagavimo kognityvinės pastangos (Popović 2018).…”

Section: Kokybės Vertinimasunclassified

MOOC Coursera Content Post-editing

Lapinskaitė

Mankauskienė

2022

View full text Add to dashboard Cite

This paper presents the post-editing features of the machine translation (MT) system Smartling used to translate the learning content of MOOC (Massive Open Online Course) Coursera. Most of the Coursera content is delivered in English, which is one of the reasons for the low uptake of these courses in Lithuania. With the growing demand for online resources, the need to translate courses into Lithuanian has become evident and MT systems are increasingly used for that purpose. This paper describes the results of an experiment carried out with the Smartling MT system. The experiment involved 10 participants, 6 professional and 4 non-professional translators, who post-edited a passage from the Coursera course The Science of Wellbeing. The post-editing process was monitored using the Translog-II tool, which captures the participants‘ keystrokes. The paper presents the classification and frequency of MT errors. One of the most important post-editing features of the Smartling MT system is the splitting of the text into subtitle lines, which is the cause of most grammatical errors. Among the errors not attributable to this text division are those caused by the polysemy of the words, literal translation and the use of pronouns. After the post-editing task, participants filled in a short questionnaire about the functionality of the Smartling system. 7 out of 10 participants rated the performance of this system as satisfactory. The results of the study showed that Smartling is not sufficiently tailored to the Lithuanian language, and that translators have to use a lot of cognitive effort in post-editing.

show abstract

Section: Kokybės Vertinimasunclassified

MOOC Coursera Content Post-editing

Lapinskaitė

Mankauskienė

2022

View full text Add to dashboard Cite

show abstract

“…Human evaluation of text quality: Most previous studies on human evaluation concentrate on constrained generation domains, such as machine translation (Guzmán et al, 2015;Graham et al, 2017;Toral et al, 2018;Castilho, 2021) or summarization (Gillick and Liu, 2010;Iskender et al, 2020). Other studies evaluate very short, often one sentence long, outputs (Grundkiewicz et al, 2015;Mori et al, 2019;Khashabi et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Karpinska¹,

Akoury²,

Iyyer³

2021

Preprint

View full text Add to dashboard Cite

Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.

show abstract

“…Research in manual evaluation has focused on overcoming annotator bias, i.e. the preferences and expectations of individual annotators with respect to translation quality that lead to low levels of inter-annotator agreement (Cohn and Specia, 2013;Denkowski and Lavie, 2010;Graham et al, 2013;Guzmán et al, 2015). The problem of reference bias, however, has not been examined in previous work.…”

Section: Related Workmentioning

confidence: 99%

Reference Bias in Monolingual Machine Translation Evaluation

Fomicheva

Specia

2016

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

View full text Add to dashboard Cite

In the translation industry, human translations are assessed by comparison with the source texts. In the Machine Translation (MT) research community, however, it is a common practice to perform quality assessment using a reference translation instead of the source text. In this paper we show that this practice has a serious issue -annotators are strongly biased by the reference translation provided, and this can have a negative impact on the assessment of MT quality.

show abstract

How do Humans Evaluate Machine Translation

Cited by 12 publications

References 12 publications

MOOC Coursera Content Post-editing

MOOC Coursera Content Post-editing

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Reference Bias in Monolingual Machine Translation Evaluation

Contact Info

Product

Resources

About