Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics 2023
DOI: 10.18653/v1/2023.eacl-main.95
|View full text |Cite
|
Sign up to set email alerts
|

Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Vilém Zouhar,
Shehzaad Dhuliawala,
Wangchunshu Zhou
et al.

Abstract: Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. Stateof-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 34 publications
0
2
0
Order By: Relevance
“…It is especially active since the recognition that decades old metrics such as BLEU and ROUGE are inadequate for evaluation Peyrard, 2019;. The focus in recent years is on developing high-quality LLM based metrics that are (among others) explainable (Kaster et al, 2021;Leiter et al, 2022aLeiter et al, , 2022bSai et al, 2021), efficient (Kamal Eddine et al, 2022Grünwald et al, 2022;Zouhar et al, 2023;, robust Rony et al, 2022), and reproducible (Chen et al, 2022;Grusky, 2023). The focus of Eval4NLP's Shared Task is on explainable high-quality metrics induced from prompting the most recent classes of LLMs including variants of LLaMA .…”
Section: Related Workmentioning
confidence: 99%
“…It is especially active since the recognition that decades old metrics such as BLEU and ROUGE are inadequate for evaluation Peyrard, 2019;. The focus in recent years is on developing high-quality LLM based metrics that are (among others) explainable (Kaster et al, 2021;Leiter et al, 2022aLeiter et al, , 2022bSai et al, 2021), efficient (Kamal Eddine et al, 2022Grünwald et al, 2022;Zouhar et al, 2023;, robust Rony et al, 2022), and reproducible (Chen et al, 2022;Grusky, 2023). The focus of Eval4NLP's Shared Task is on explainable high-quality metrics induced from prompting the most recent classes of LLMs including variants of LLaMA .…”
Section: Related Workmentioning
confidence: 99%
“…It is especially active since the recognition that decades old metrics such as BLEU (Papineni et al, 2002) and ROUGE (Lin, 2004) are inadequate for evaluation (Mathur et al, 2020;Peyrard, 2019;Freitag et al, 2022). The focus in recent years is on developing high-quality LLM based metrics (Zhang et al, 2020;Zhao et al, 2019) that are (among others) explainable (Kaster et al, 2021;Leiter et al, 2022aLeiter et al, , 2023aLeiter et al, , 2022bSai et al, 2021), efficient (Kamal Eddine et al, 2022Grünwald et al, 2022;Zouhar et al, 2023;Belouadi and Eger, 2023), robust (Chen and Eger, 2023;Rony et al, 2022), and reproducible (Chen et al, 2022;Grusky, 2023). The focus of Eval4NLP's Shared Task is on explainable high-quality metrics induced from prompting the most recent classes of LLMs including variants of LLaMA (Touvron et al, 2023).…”
Section: Related Workmentioning
confidence: 99%