2019
DOI: 10.48550/arxiv.1904.09675
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BERTScore: Evaluating Text Generation with BERT

Abstract: We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than exist… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
379
0
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 326 publications
(382 citation statements)
references
References 73 publications
1
379
0
2
Order By: Relevance
“…The first reward promotes the coverage of radiology domain entities with corresponding reference reports, and the second reward promotes the consistency of the generated reports with their descriptions in the reference reports. Further, they combine these reward functions with the semantic equivalence metric of BERTScore [345] that results in generated reports with better performance in terms of clinical metrics.…”
Section: Reinforcement Learning Based Approachesmentioning
confidence: 99%
“…The first reward promotes the coverage of radiology domain entities with corresponding reference reports, and the second reward promotes the consistency of the generated reports with their descriptions in the reference reports. Further, they combine these reward functions with the semantic equivalence metric of BERTScore [345] that results in generated reports with better performance in terms of clinical metrics.…”
Section: Reinforcement Learning Based Approachesmentioning
confidence: 99%
“…There are many types of trained models that can be used for stylized caption evaluation. These include text classifiers trained to identify the style of a sentence, or a model like BertScore [24], which uses a trained language model to compare generated text against a reference sentence. One of the main limitations of using a trained model as an evaluation metric is that it may require retraining to use on a different dataset.…”
Section: Related Workmentioning
confidence: 99%
“…In a large and comprehensive analysis of metrics in MT, however, [29] recommends deprecating BLEU as the MT evaluation standard. They suggest using more recent metrics such as BERTScore [87] or COMET [64], that are shown to reflect human judgement significantly better.…”
Section: Evaluating Metricsmentioning
confidence: 99%
“…BERTScore [87] leverages the contextualised representation of BERT to compute the similarity between the tokens.…”
Section: Metricsmentioning
confidence: 99%