Embracing Ambiguity: Shifting the Training Target of NLI Models

Meissner, Johannes Mario; Thumwanit, Napat; Sugawara, Saku; Aizawa, Akiko

doi:10.18653/v1/2021.acl-short.109

Cited by 8 publications

(10 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Distributional models Distributional models aim to predict the distribution of annotator judgments. We use two models from prior work: 1) one trained on AmbiNLI (Meissner et al, 2021), with examples with multiple annotations from SNLI (Bowman et al, 2015) and MNLI, and 2) ing distributional labels into discrete ones with a threshold of 0.2. In addition, we train a multilabel model on WANLI's train set (which has two annotations per example), as well as a classifier over sets which performs 7-way classification over the power set of NLI labels, minus the empty set.…”

Section: Regression Modelsmentioning

confidence: 99%

“…The AmbiNLI model (Meissner et al, 2021) is first pretrained on single-label data from SNLI + MNLI for 3 epochs, then further finetuned on Am-biNLI for 2 epochs. AmbiNLI examples have distributional outputs, and is sourced from the development set of SNLI and MNLI (which contain 5 labels) and train set of UNLI (which are heuristically mapped to soft labels).…”

Section: D2 Training Detailsmentioning

confidence: 99%

See 1 more Smart Citation

We’re Afraid Language Models Aren’t Modeling Ambiguity

Liu,

Wu,

Michael

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Ambiguity is an intrinsic feature of natural language. Managing ambiguity is a key part of human language understanding, allowing us to anticipate misunderstanding as communicators and revise our interpretations as listeners. As language models are increasingly employed as dialogue interfaces and writing aids, handling ambiguous language is critical to their success. We capture ambiguity in a sentence through its effect on entailment relations with another sentence, and collect AMBIENT, 1 a linguist-annotated benchmark of 1,645 examples with diverse kinds of ambiguity. We design a suite of tests based on AMBIENT, presenting the first evaluation of pretrained LMs to recognize ambiguity and disentangle possible meanings. We find that the task remains extremely challenging, including for GPT-4, whose generated disambiguations are considered correct only 32% of the time in crowdworker evaluation, compared to 90% for disambiguations in our dataset. Finally, to illustrate the value of ambiguity-sensitive tools, we show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity. We encourage the field to rediscover the importance of ambiguity for NLP.

show abstract

Section: Regression Modelsmentioning

confidence: 99%

Section: D2 Training Detailsmentioning

confidence: 99%

We’re Afraid Language Models Aren’t Modeling Ambiguity

Liu,

Wu,

Michael

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…However, we may not simply attribute such disagreement to poor annotation quality, since there is inherent ambiguity in the annotations of natural language inference tasks, as is reported by Nie et al (2020). We can still make reasonable probabilistic estimation of the status by embracing the ambiguity and directly learn from the annotation distribution (Meissner et al, 2021). Therefore, we change the learning target from binary labels to the portion of annotators who label the status as uncertain, and the possible values are thus 0, 1/3, 2/3 and 1.…”

Section: Symptom Status Inferencementioning

confidence: 99%

Symptom Identification for Interpretable Detection of Multiple Mental Disorders

Zhang¹,

Chen²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Mental disease detection (MDD) from social media has suffered from poor generalizability and interpretability, due to lack of symptom modeling. This paper introduces PsySym, the first annotated symptom identification corpus of multiple psychiatric disorders, to facilitate further research progress. PsySym is annotated according to a knowledge graph of the 38 symptom classes related to 7 mental diseases complied from established clinical manuals and scales, and a novel annotation framework for diversity and quality. Experiments show that symptom-assisted MDD enabled by PsySym can outperform strong pure-text baselines. We also exhibit the convincing MDD explanations provided by symptom predictions with case studies, and point to their further potential applications. 1 * Equal Contribution † Corresponding Authors 1 Code and dataset can be provided upon request.

show abstract

“…However, the tasks concerning conditions under multiple plausible scenarios are few, and their domains are limited to, for example, factual information that differs according to place and time (Zhang and Choi, 2021) or human behaviors that are either normative or divergent (Emelin et al, 2021). Another example is the natural language inference or commonsense reasoning task that considers variations in human opinions (Zhang et al, 2017;Chen et al, 2020b), which allows for the differences in annotations due to one's mentality (Pavlick and Kwiatkowski, 2019;Meissner et al, 2021). Our aim here is to interrogate these types of situated reasoning in more comprehensive settings, such as in story texts.…”

Section: Original Endingmentioning

confidence: 99%

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Ashida¹,

Sugawara²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset, Possible Stories, consists of more than 4.5K questions over 1.3K story texts in English. We discover that even current strong pretrained language models struggle to answer the questions consistently, highlighting that the highest accuracy in an unsupervised setting (60.2%) is far behind human accuracy (92.5%). Through a comparison with existing datasets, we observe that the questions in our dataset contain minimal annotation artifacts in the answer options. In addition, our dataset includes examples that require counterfactual reasoning, as well as those requiring readers' reactions and fictional information, suggesting that our dataset can serve as a challenging testbed for future studies on situated commonsense reasoning.

show abstract

Embracing Ambiguity: Shifting the Training Target of NLI Models

Cited by 8 publications

References 8 publications

We’re Afraid Language Models Aren’t Modeling Ambiguity

We’re Afraid Language Models Aren’t Modeling Ambiguity

Symptom Identification for Interpretable Detection of Multiple Mental Disorders

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Contact Info

Product

Resources

About