2022
DOI: 10.1075/intp.00076.lu
|View full text |Cite
|
Sign up to set email alerts
|

Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics

Abstract: Automated metrics for machine translation (MT) such as BLEU are customarily used because they are quick to compute and sufficiently valid to be useful in MT assessment. Whereas the instantaneity and reliability of such metrics are made possible by automatic computation based on predetermined algorithms, their validity is primarily dependent on a strong correlation with human assessments. Despite the popularity of such metrics in MT, little research has been conducted to explore their usefulness in the automati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
2

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 26 publications
0
6
2
Order By: Relevance
“…For RQ1 and RQ2, we find that ChatGPT excels under human evaluation and semantic-aware automatic evaluation (see Figure 1 and Table 2). For RQ3, we find that the two methods may exhibit divergences when approaching machine-translated texts, contrary to Lu and Han (2023)'s findings that automated metrics can show moderate to strong correlations with human-assigned scores in assessing interpreting outputs, possibly due to the inherent differences between interpreting and translation. ing (Brown et al, 2020;Chowdhery et al, 2022;Wei et al, 2022a,b;Wang et al, 2022), and several studies have explored the influence of prompting strategies on the translation performance of LLMs (Jiao et al, 2023;Hendy et al, 2023;Peng et al, 2023;Chen et al, 2023;He et al, 2023).…”
Section: Introductioncontrasting
confidence: 90%
See 1 more Smart Citation
“…For RQ1 and RQ2, we find that ChatGPT excels under human evaluation and semantic-aware automatic evaluation (see Figure 1 and Table 2). For RQ3, we find that the two methods may exhibit divergences when approaching machine-translated texts, contrary to Lu and Han (2023)'s findings that automated metrics can show moderate to strong correlations with human-assigned scores in assessing interpreting outputs, possibly due to the inherent differences between interpreting and translation. ing (Brown et al, 2020;Chowdhery et al, 2022;Wei et al, 2022a,b;Wang et al, 2022), and several studies have explored the influence of prompting strategies on the translation performance of LLMs (Jiao et al, 2023;Hendy et al, 2023;Peng et al, 2023;Chen et al, 2023;He et al, 2023).…”
Section: Introductioncontrasting
confidence: 90%
“…Analytic rubric scoring is another method widely adopted in TQA research. It is founded on the assumption that the overall concept of quality can be broken down into individual components, and typically comprises several sub-scales addressing separate dimensions of translation (Lu and Han, 2023). To complement the error typology-based evaluation, we propose six analytic rubrics to capture translation quality from different perspectives, encompassing dimensions of (1) coherence, (2) adherence to norms, (3) style, tone, and register appropriateness, (4) cultural sensitivity, (5) clarity, and (6) practicality.…”
Section: Human Evaluation Based On Analytic Rubric Scoringmentioning
confidence: 99%
“…In recent empirical studies (Chung, 2020;Han and Lu, 2021;Lu and Han, 2022), a few researchers have investigated the utility of several metrics (i.e., BLEU, METEOR, NIST, and TER) in assessing translations or interpretations and correlate the metric scores with the human assigned scores. Chung (2020) computes two metrics (i.e., BLEU and METEOR) to assess 120 German-to-Korean translations produced by ten student translators on 12 German texts concerning a variety of topics.…”
Section: Computational Features For Fidelity Assessmentmentioning
confidence: 99%
“…Recently, Lu and Han (2022) in another study evaluate 56 bidirectional consecutive English-Chinese interpretations produced by 28 student interpreters of varying abilities by the same metrics and one more pre-trained model, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al, 2019). They correlate the automated metric scores with the scores assigned by different types of raters using different scoring methods (i.e., multiple assessment scenarios).…”
Section: Computational Features For Fidelity Assessmentmentioning
confidence: 99%
See 1 more Smart Citation