Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.417
|View full text |Cite
|
Sign up to set email alerts
|

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Abstract: Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across stateof-the-art models on two news summarization datasets using … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…Kiritchenko and Mohammad (2017) demonstrated that best-worst scaling (asking evaluators to choose the best and the worst items in a set) is an efficient and reliable method for collecting annotations, and this approach has been used to collect comparative evaluations of generated text (e.g., Liu & Lapata, 2019;Amplayo et al, 2021). Best-worst scaling was also more recently been shown as a more effective approach than Likert scales for assessing factual consistency of summaries (Tang et al, 2022). Belz and Kow (2011) further compare continuous and discrete rating scales and found that both lead to similar results, but raters preferred continuous scales, consistent with prior findings (Svensson, 2000).…”
Section: How Is It Measured?mentioning
confidence: 99%
“…Kiritchenko and Mohammad (2017) demonstrated that best-worst scaling (asking evaluators to choose the best and the worst items in a set) is an efficient and reliable method for collecting annotations, and this approach has been used to collect comparative evaluations of generated text (e.g., Liu & Lapata, 2019;Amplayo et al, 2021). Best-worst scaling was also more recently been shown as a more effective approach than Likert scales for assessing factual consistency of summaries (Tang et al, 2022). Belz and Kow (2011) further compare continuous and discrete rating scales and found that both lead to similar results, but raters preferred continuous scales, consistent with prior findings (Svensson, 2000).…”
Section: How Is It Measured?mentioning
confidence: 99%
“…Generative language models have been widely adopted for response generation [14,28]; however, in the realm of open-ended information-seeking dialogues, the assumption that a user's query can be definitively answered by simply summarizing information from top retrieved passages falls short of reality. System responses are susceptible to various limitations, such as the failure to find a response which may result in hallucinations [10], providing a biased response only partially answering the question [9], or even presenting content with factual errors [26]. Consequently, relying solely on summarizing relevant information may lead to providing users with biased, incomplete, or, worse, incorrect responses [26].…”
Section: Introductionmentioning
confidence: 99%
“…System responses are susceptible to various limitations, such as the failure to find a response which may result in hallucinations [10], providing a biased response only partially answering the question [9], or even presenting content with factual errors [26]. Consequently, relying solely on summarizing relevant information may lead to providing users with biased, incomplete, or, worse, incorrect responses [26].…”
Section: Introductionmentioning
confidence: 99%