Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-1152
|View full text |Cite
|
Sign up to set email alerts
|

Estimating Summary Quality with Pairwise Preferences

Abstract: Automatic evaluation systems in the field of automatic summarization have been relying on the availability of gold standard summaries for over ten years. Gold standard summaries are expensive to obtain and often require the availability of domain experts to achieve high quality. In this paper, we propose an alternative evaluation approach based on pairwise preferences of sentences. In comparison to gold standard summaries, they are simpler and cheaper to obtain. In our experiments, we show that humans are able… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
15
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(16 citation statements)
references
References 29 publications
1
15
0
Order By: Relevance
“…As for the intrinsic evaluation function U * , recent work has suggested that human preferences over summaries have high correlations to ROUGE scores (Zopf, 2018). Therefore, we define: (2017).…”
Section: April: Decomposing Sppi Into Active Preference Learning and Rlmentioning
confidence: 99%
“…As for the intrinsic evaluation function U * , recent work has suggested that human preferences over summaries have high correlations to ROUGE scores (Zopf, 2018). Therefore, we define: (2017).…”
Section: April: Decomposing Sppi Into Active Preference Learning and Rlmentioning
confidence: 99%
“…The use of preference-based feedback in NLP attracts increasing research interest. Zopf (2018) Fig. 1: SPPI (a) directly uses the collected preferences to "teach" its summarygenerator, while APRIL (b) learns a reward function as the proxy of the user/oracle, and uses the learnt reward to "teach" the RL-based summariser.…”
Section: Introductionmentioning
confidence: 99%
“…First, research has shown that ROUGE is inconsistent with human evaluation for summary quality (Liu and Liu, 2008;Zopf, 2018;Kryscinski et al, 2019;Maynez et al, 2020). We evaluate ROUGE using PolyTope from the perspective of both instance-level and system-level performances.…”
Section: Analysis Of Evaluation Methodsmentioning
confidence: 99%
“…In fact, while yielding rich conclusions, the above analytical work has also exposed deficiencies of automatic toolkits. The quality of automatic evaluation is often criticized by the research community (Novikova et al, 2017;Zopf, 2018) for its insufficiency in neither permeating into the overall quality of generation-based texts (Liu and Liu, 2008) nor correlating with human judgements (Kryscinski et al, 2019).…”
Section: Related Workmentioning
confidence: 99%