Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Pa 2008
DOI: 10.3115/1557690.1557747
|View full text |Cite
|
Sign up to set email alerts
|

Correlation between ROUGE and human evaluation of extractive meeting summaries

Abstract: Automatic summarization evaluation is critical to the development of summarization systems. While ROUGE has been shown to correlate well with human evaluation for content match in text summarization, there are many characteristics in multiparty meeting domain, which may pose potential problems to ROUGE. In this paper, we carefully examine how well the ROUGE scores correlate with human evaluation for extractive meeting summarization. Our experiments show that generally the correlation is rather low, but a signi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
402
0
1

Year Published

2011
2011
2021
2021

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 340 publications
(407 citation statements)
references
References 13 publications
4
402
0
1
Order By: Relevance
“…Although the generated summaries were not good from a human's point of view, they obtained good results for some ROUGE metrics (for example, a recall score of 41% for ROUGE-1, which is acceptable in the state-of-the-art in this research field). In addition, the correlation between ROUGE and model summaries was shown to be lower than it was claimed, especially in some summarization types, such as in speech summarization (Liu and Liu, 2008b). Despite the need to have model summaries beforehand when using ROUGE, various researchers have shown that there is significant correlation between ROUGE scores and approaches based on human comparison of semantic content units (indeed this was necessary for ROUGE to win acceptance).…”
Section: Summary Contentmentioning
confidence: 97%
See 1 more Smart Citation
“…Although the generated summaries were not good from a human's point of view, they obtained good results for some ROUGE metrics (for example, a recall score of 41% for ROUGE-1, which is acceptable in the state-of-the-art in this research field). In addition, the correlation between ROUGE and model summaries was shown to be lower than it was claimed, especially in some summarization types, such as in speech summarization (Liu and Liu, 2008b). Despite the need to have model summaries beforehand when using ROUGE, various researchers have shown that there is significant correlation between ROUGE scores and approaches based on human comparison of semantic content units (indeed this was necessary for ROUGE to win acceptance).…”
Section: Summary Contentmentioning
confidence: 97%
“…On the one hand, most content-oriented evaluation tools are based on content overlap, which presents a bias toward lexical similarity that may lead to unfair penalties when abstractive summaries are evaluated. However, it is interesting to mention that, in spite of this a priori disadvantageous situation, abstractive human summaries usually get significantly higher ROUGE scores (Liu and Liu, 2008a).…”
Section: Discussionmentioning
confidence: 99%
“…These TT set corresponds to the top x terms ranked based on the probability of the word given the topic (p(w|k)) from the topic model. We evaluated these summarisation approaches with the ROUGE-1 method (Lin, 2004), a widely used summarisation evaluation metric that correlates well with human evaluation (Liu and Liu, 2008). This method measures the overlap of words between the generated summary and a reference, in our case the GS generated from the NW dataset.…”
Section: Resultsmentioning
confidence: 99%
“…In particular, ROUGE-1, which works at the unigram level, was shown to significantly correlate with human evaluations. While it has been suggested than correlation may be weaker in the meeting domain (Liu and Liu, 2008), we stuck to ROUGE because 5 http://www-nlpir.nist.gov/projects/duc/ guidelines/2001.html of the lack of a clear substitute, and for consistency with the literature, as a very large majority of studies previously published in the domain use ROUGE.…”
Section: Discussionmentioning
confidence: 99%