Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Zouhar, Vilém; Dhuliawala, Shehzaad; Zhou, Wangchunshu; Daheim, Nico; Kocmi, Tom; Jiang, Yuchen Eleanor; Sachan, Mrinmaya

doi:10.18653/v1/2023.eacl-main.95

Cited by 2 publications

(2 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is especially active since the recognition that decades old metrics such as BLEU and ROUGE are inadequate for evaluation Peyrard, 2019;. The focus in recent years is on developing high-quality LLM based metrics that are (among others) explainable (Kaster et al, 2021;Leiter et al, 2022aLeiter et al, , 2022bSai et al, 2021), efficient (Kamal Eddine et al, 2022Grünwald et al, 2022;Zouhar et al, 2023;, robust Rony et al, 2022), and reproducible (Chen et al, 2022;Grusky, 2023). The focus of Eval4NLP's Shared Task is on explainable high-quality metrics induced from prompting the most recent classes of LLMs including variants of LLaMA .…”

Section: Related Workmentioning

confidence: 99%

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

2023

View full text Add to dashboard Cite

The continuous progress in Named Entity Recognition allows the identification of complex entities in multiple domains. The traditionally used metrics like precision, recall, and F1-score can only reflect the classification quality of the underlying NER model to a limited extent. Existing metrics do not distinguish between a non-recognition of an entity and a misclassification of an entity. Additionally, the dealing with redundant entities remains unaddressed. We propose WRF, a Weighted Rouge F1 metric for Entity Recognition, to solve the mentioned gaps in currently available metrics. We successfully employ the WRF metric for automotive entity recognition, followed by a comprehensive qualitative and quantitative analysis of the obtained results.

show abstract

Section: Related Workmentioning

confidence: 99%

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

2023

View full text Add to dashboard Cite

show abstract

“…It is especially active since the recognition that decades old metrics such as BLEU (Papineni et al, 2002) and ROUGE (Lin, 2004) are inadequate for evaluation (Mathur et al, 2020;Peyrard, 2019;Freitag et al, 2022). The focus in recent years is on developing high-quality LLM based metrics (Zhang et al, 2020;Zhao et al, 2019) that are (among others) explainable (Kaster et al, 2021;Leiter et al, 2022aLeiter et al, , 2023aLeiter et al, , 2022bSai et al, 2021), efficient (Kamal Eddine et al, 2022Grünwald et al, 2022;Zouhar et al, 2023;Belouadi and Eger, 2023), robust (Chen and Eger, 2023;Rony et al, 2022), and reproducible (Chen et al, 2022;Grusky, 2023). The focus of Eval4NLP's Shared Task is on explainable high-quality metrics induced from prompting the most recent classes of LLMs including variants of LLaMA (Touvron et al, 2023).…”

Section: Related Workmentioning

confidence: 99%

Team NLLG submission for Eval4NLP 2023 Shared Task: Retrieval-Augmented In-Context Learning for NLG Evaluation

Larionov,

Viskov,

Kokush

et al. 2023

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

In this paper, we introduce a novel approach for evaluating natural language generation (NLG) using retrieval-augmented in-context learning. Our method empowers practitioners to leverage large language models (LLMs) for diverse NLG evaluation tasks without the need for finetuning. We put our approach to the test in the context of the Eval4NLP 2023 Shared Task, specifically in translation evaluation and summarization evaluation subtasks. The results indicate that retrieval-augmented in-context learning holds great promise for the development of LLM-based NLG evaluation metrics. Future research directions involve investigating the performance of various publicly available LLM models and identifying the specific LLM attributes that contribute to enhancing metric quality.

show abstract

Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Cited by 2 publications

References 34 publications

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

Team NLLG submission for Eval4NLP 2023 Shared Task: Retrieval-Augmented In-Context Learning for NLG Evaluation

Contact Info

Product

Resources

About