Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.42
|View full text |Cite
|
Sign up to set email alerts
|

GO FIGURE: A Meta Evaluation of Factuality in Summarization

Abstract: While neural language models can generate text with remarkable fluency and coherence, controlling for factual correctness in generation remains an open research question. This major discrepancy between the surface-level fluency and the content-level correctness of neural generation has motivated a new line of research that seeks automatic metrics for evaluating the factuality of machine text. In this paper, we introduce GO FIGURE, a metaevaluation framework for evaluating factuality evaluation metrics. We prop… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
49
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(49 citation statements)
references
References 31 publications
0
49
0
Order By: Relevance
“…However, the grounding of summary generation that was inherent to most traditional methods is yet to be achieved in neural summarization. The attention mechanism (Bahdanau et al, 2015), especially in pretrained encoder-decoder models (Lewis et al, 2020;Raffel et al, 2019;Zhang et al, 2020), plays a key role in aligning summary content to the input, yet undesired hallucinations are common in generated summaries (Maynez et al, 2020;Kryscinski et al, 2020;Gabriel et al, 2021).…”
mentioning
confidence: 99%
“…However, the grounding of summary generation that was inherent to most traditional methods is yet to be achieved in neural summarization. The attention mechanism (Bahdanau et al, 2015), especially in pretrained encoder-decoder models (Lewis et al, 2020;Raffel et al, 2019;Zhang et al, 2020), plays a key role in aligning summary content to the input, yet undesired hallucinations are common in generated summaries (Maynez et al, 2020;Kryscinski et al, 2020;Gabriel et al, 2021).…”
mentioning
confidence: 99%
“…Basically, the revision strategy consists of three stages: (1) Phrase Trimming: remove phrases unsupported by source in the exemplar sentence; (2) Decontextualization: resolve co-reference and delete phrases dependent on context; (3) Syntax Modification: make the purified sentences flow smoothly. There are also some works [52,72] leveraging the model to generate data and instruct annotators to label whether these outputs contain hallucinations or not. While this approach is typically used to build diagnostic evaluation datasets, it has the potential to build faithful datasets.…”
Section: Hallucination Mitigation Methodsmentioning
confidence: 99%
“…The results show both QAGS and FEQA have substantially higher correlations with human judgments of faithfulness than the baseline metrics. In addition, Gabriel et al [52] further analyze the FEQA and find that the effectiveness of QA-based metrics depends on the question. They also provide a meta-evaluation framework that includes QA metrics.…”
Section: Hallucination Metrics In Abstractive Summarizationmentioning
confidence: 99%
“…Kryscinski et al (2019) note biases and failure modes of abstractive summarization models, while other work analyzes and collects annotations over the output of recent summarization models across multiple dimensions, including factual consistency Fabbri et al (2021); Bhandari et al (2020); Huang et al (2020). Lux et al (2020) propose a typology of errors found in summarization models, while Gabriel et al (2021) propose a framework for meta-evaluation of faithfulness consistency metrics. Laban et al (2021) propose to combine recent work in factual consistency evaluation for summarization through a single benchmark.…”
Section: Related Workmentioning
confidence: 99%