Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021
DOI: 10.18653/v1/2021.eacl-main.202
|View full text |Cite
|
Sign up to set email alerts
|

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

Abstract: As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus of NL explanation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(21 citation statements)
references
References 58 publications
(63 reference statements)
0
21
0
Order By: Relevance
“…Automatic (and human) evaluation processes are well-known problems for the field of Natural Language Generation (Howcroft et al, 2020;Clinciu et al, 2021) and the burgeoning subfield of ST is not immune. ST, in particular, has suffered from a lack of standardization of automatic metrics, a lack of agreement between human judgments and automatics metrics, as well as a blindspot to developing metrics for languages other than English.…”
Section: Discussionmentioning
confidence: 99%
“…Automatic (and human) evaluation processes are well-known problems for the field of Natural Language Generation (Howcroft et al, 2020;Clinciu et al, 2021) and the burgeoning subfield of ST is not immune. ST, in particular, has suffered from a lack of standardization of automatic metrics, a lack of agreement between human judgments and automatics metrics, as well as a blindspot to developing metrics for languages other than English.…”
Section: Discussionmentioning
confidence: 99%
“…11 We only ask Questions 2 and 3 if the answer to Question 1 is "yes" because they regard the new facts, information, or reasoning. We found that most prior work tends to lump added-value, relevance, and adequacy judgements into one "informativeness" judgement (Clinciu et al, 2021), which we felt was too course to allow for meaningful error analysis.…”
Section: B2 Absolute Interface Detailsmentioning
confidence: 93%
“…Figures 5 and 6 show the absolute evaluation interface. Our interface is inspired by prior work from psychology and the social sciences (Leake, 1991;Gopnik, 1998;Lombrozo, 2007;Zemla et al, 2017;Chiyah Garcia et al, 2018;Clinciu et al, 2021;Sulik et al, 2021). We iterated over 3-4 versions of the questions and UI design until we had optimized agreement rates as much as possible.…”
Section: B2 Absolute Interface Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, we additionally evaluate commonsense inclusion as well as grammatical correctness of explanations. As Clinciu et al (2021) find automatic BLEURT scores to have distinctly stronger correlations to human ratings of generated textual explanations than BLEU, we investigate whether BLEURT is a viable replacement for a user study.…”
Section: Evaluation and Human Ratingsmentioning
confidence: 99%