2021
DOI: 10.48550/arxiv.2110.09147
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Thomas Scialom,
Felix Hill

Abstract: Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -functions that score system output given the context and/or human reference responses -of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 65 publications
0
2
0
Order By: Relevance
“…Adding to the related work mentioned throughout the paper, works on unified evaluation of text generation across tasks include GEM , where the focus is on evaluating system outputs and not the factual consistency evaluation methods as in TRUE. BEAMetrics (Scialom and Hill, 2021) proposes meta-evaluation protocols across tasks, but does not focus on factual consistency. When discussing consistency ("correctness") they measure correlations, which are not sufficient as mentioned in Section 2.3.…”
Section: Related Workmentioning
confidence: 99%
“…Adding to the related work mentioned throughout the paper, works on unified evaluation of text generation across tasks include GEM , where the focus is on evaluating system outputs and not the factual consistency evaluation methods as in TRUE. BEAMetrics (Scialom and Hill, 2021) proposes meta-evaluation protocols across tasks, but does not focus on factual consistency. When discussing consistency ("correctness") they measure correlations, which are not sufficient as mentioned in Section 2.3.…”
Section: Related Workmentioning
confidence: 99%
“…generation across tasks include GEM , where the focus is on evaluating system outputs and not the factual consistency evaluation methods as in TRUE. BEAMetrics (Scialom and Hill, 2021) proposes meta-evaluation protocols across tasks, but does not focus on factual consistency. When discussing consistency ("correctness") they measure correlations, which are not sufficient as mentioned in Section 2.3. present an adversarial meta-evaluation for factual consistency evaluators, focused on summarization.…”
Section: Related Workmentioning
confidence: 99%