Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

Deutsch, Daniel; Dror, Rotem; Roth, Dan

doi:10.48550/arxiv.2204.10216

Cited by 4 publications

(4 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is mostly due to the inflexibility of computing the F1 score sorely based on the similarity between the generated responses and golden references. Such observations echo findings from previous studies (Madotto et al, 2019;Liu et al, 2016;Deutsch et al, 2022). In this work, we still utilize F1 score since it's a standardized metric for PERSONA-CHAT evaluation.…”

Section: Limitationssupporting

confidence: 83%

PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer

Han,

Guo,

Jung

et al. 2023

Proceedings of the Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)

View full text Add to dashboard Cite

Personalized dialogue agents (DAs) powered by large pre-trained language models (PLMs) often rely on explicit persona descriptions to maintain personality consistency. However, such descriptions may not always be available or may pose privacy concerns. To tackle this bottleneck, we introduce PersonaPKT, a lightweight transfer learning approach that can build persona-consistent dialogue models without explicit persona descriptions. By representing each persona as a continuous vector, PersonaPKT learns implicit persona-specific features directly from a small number of dialogue samples produced by the same persona, adding less than 0.1% trainable parameters for each persona on top of the PLM backbone. Empirical results demonstrate that PersonaPKT effectively builds personalized DAs with high storage efficiency, outperforming various baselines in terms of persona consistency while maintaining good response generation quality. In addition, it enhances privacy protection by avoiding explicit persona descriptions. Overall, PersonaPKT is an effective solution for creating personalized DAs that respect user privacy.

show abstract

Section: Limitationssupporting

confidence: 83%

PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer

Han,

Guo,

Jung

et al. 2023

Proceedings of the Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)

View full text Add to dashboard Cite

show abstract

“…Deutsch et al (2021) investigated the preciseness of correlations between metrics and human annotations in meta-evaluation benchmarks, and proposed approaches to improve the level of confidence. Finally, Deutsch et al (2022) discussed ways of improving the reliability of system-level correlations in meta-evaluation.…”

Section: Background and Related Workmentioning

confidence: 99%

Re-Examining Summarization Evaluation across Multiple Quality Criteria

Ernst,

Shapira,

Dagan

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

The common practice for assessing automatic evaluation metrics is to measure the correlation between their induced system rankings and those obtained by reliable human evaluation, where a higher correlation indicates a better metric. Yet, an intricate setting arises when an NLP task is evaluated by multiple Quality Criteria (QCs), like for text summarization where prominent criteria include relevance, consistency, fluency and coherence. In this paper, we challenge the soundness of this methodology when multiple QCs are involved, concretely for the summarization case. First, we show that the allegedly best metrics for certain QCs actually do not perform well, failing to detect even drastic summary corruptions with respect to the considered QC. To explain this, we show that some of the high correlations obtained in the multi-QC setup are spurious. Finally, we propose a procedure that may help detect this effect. Overall, our findings highlight the need for further investigating metric evaluation methodologies for the multiple-QC case.

show abstract

“…However, such system-level numbers are not very informative when one is interested in evaluating the absolute performance of inconsistency detection methods that perform a binary decision w.r.t each input. Deutsch et al (2022) also recently discuss various issues in measuring system-level correlations to assess the validity of automatic evaluation metrics for summarization.…”

Section: Meta-evaluationmentioning

confidence: 99%

TRUE: Re-evaluating Factual Consistency Evaluation

Honovich¹,

Aharoni²,

Herzig³

et al. 2022

Proceedings of the Second DialDoc Workshop on Document-Grounded Dialogue and Conversational Question Answering

View full text Add to dashboard Cite

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the examplelevel accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level metaevaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods. 1 * Work done during an internship at Google Research.

show abstract

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

Cited by 4 publications

References 5 publications

PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer

PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer

Re-Examining Summarization Evaluation across Multiple Quality Criteria

TRUE: Re-evaluating Factual Consistency Evaluation

Contact Info

Product

Resources

About