2019
DOI: 10.1162/tacl_a_00293
|View full text |Cite
|
Sign up to set email alerts
|

Inherent Disagreements in Human Textual Inferences

Abstract: We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation “noise”, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

10
175
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 157 publications
(186 citation statements)
references
References 36 publications
10
175
1
Order By: Relevance
“…In this case, a piece of evidence contradicts a relative clause in the claim but does not refute the entire claim. Similar problems regarding the uncertainty of NLI tasks have been pointed out in previous works (Zaenen et al, 2005;Pavlick and Kwiatkowski, 2019;Chen et al, 2020a).…”
Section: Claim Labelingsupporting
confidence: 76%
See 1 more Smart Citation
“…In this case, a piece of evidence contradicts a relative clause in the claim but does not refute the entire claim. Similar problems regarding the uncertainty of NLI tasks have been pointed out in previous works (Zaenen et al, 2005;Pavlick and Kwiatkowski, 2019;Chen et al, 2020a).…”
Section: Claim Labelingsupporting
confidence: 76%
“…However, we find that the decision between RE-FUTED and NOTENOUGHINFO can be ambiguous in many-hop claims and even the high-quality, trained annotators from Appen, instead of Mturk, cannot consistently choose the correct label from these two classes. Recent works (Pavlick and Kwiatkowski, 2019;Chen et al, 2020a) have raised concern over the uncertainty of NLI tasks with categorical labels and proposed to shift to a probabilistic scale. Since this work is mainly targeting the many-hop retrieval, we combine the REFUTED and NOTENOUGHINFO into a single class, namely NOT-SUPPORTED.…”
Section: Introductionmentioning
confidence: 99%
“…In their study, annotators had to select the degree to which a premise entails a hypothesis, on a scale (Chen et al, 2020) (instead of discrete labels). Pavlick and Kwiatkowski (2019) show that even though these datasets are reported to have high agreement scores, specific examples suffer from inherent disagreements. For instance, in about 20% of the inspected examples, "there is a nontrivial second component" (e.g.…”
Section: Resultsmentioning
confidence: 92%
“…Our findings are related to theirs, although not identical: while the disagreements they report are due to the individuals' interpretations of a situation, in our case, disagreements are due to the difficulty in imagining a different scenario. While some works propose to collect annotator disagreements and use them as inputs (Plank et al, 2014;Palomaki et al, 2018) (see Pavlick and Kwiatkowski (2019) for an elaborated overview), this will not hold in our case, because only one of the labels is typically correct.…”
Section: Resultsmentioning
confidence: 97%
“…Next, disagreements are resolved through a followup adjudication process. Following (Pavlick and Kwiatkowski, 2019), we surface any inherent ambiguity/disagreement between annotators in the final set of labels. Even after the adjudication process, if raters fail to resolve (say, rater 1 sticks to attribute a and rater 2 and rater 3 stick to attribute b), we propagate {a, b} as the final label (accounting for 2.4% of the non-singleton labels).…”
Section: Characterizing the Annotated Datamentioning
confidence: 99%