An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference

Li, Tianyu; Zheng, Xin; Ding, Xiaoan; Chang, Baobao; Sui, Zhifang

doi:10.18653/v1/2020.conll-1.48

Cited by 15 publications

(22 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Biases in NLI Table 5: Results on the NLI adversarial test benchmark (Liu et al, 2020b). We compare with the data augmentation techniques investigated by Liu et al (2020b).…”

Section: Adversarial Tests For Combating Distinctmentioning

confidence: 99%

“…Biases in NLI Table 5: Results on the NLI adversarial test benchmark (Liu et al, 2020b). We compare with the data augmentation techniques investigated by Liu et al (2020b). * are reported results and underscore indicates statistical significance against the baseline.…”

Section: Adversarial Tests For Combating Distinctmentioning

confidence: 99%

“…To evaluate our approach, we use the task of Natural Language Inference (NLI), which offers a wide range of datasets (including challenge datasets) for various domains. We generate debiased SNLI (Bowman et al, 2015) and MNLI (Williams et al, 2018) distributions and evaluate the generalisability of models trained on them to out-of-distribution hard evaluation sets (Gururangan et al, 2018;McCoy et al, 2019), and the adversarial attack suite for NLI proposed by Liu et al (2020b). Furthermore, we compare our method to strong debiasing strategies from the literature (Belinkov et al, 2019b;Stacey et al, 2020;Clark et al, 2019;Karimi Mahabadi et al, 2020;Utama et al, 2020;Sanh et al, 2021;Ghaddar et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets

Wu¹,

Gardner²,

Stenetorp³

et al. 2022

Preprint

View full text Add to dashboard Cite

Natural language processing models often exploit spurious correlations between taskindependent features and labels in datasets to perform well only within the distributions they are trained on, while not generalising to different task distributions. We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model, by simply replacing its training data. Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations, measured in terms of z-statistics. We generate debiased versions of the SNLI and MNLI datasets, 1 and we evaluate on a large suite of debiased, outof-distribution, and adversarial test sets. Results show that models trained on our debiased datasets generalise better than those trained on the original datasets in all settings. On the majority of the datasets, our method outperforms or performs comparably to previous state-ofthe-art debiasing strategies, and when combined with an orthogonal technique, productof-experts, it improves further and outperforms previous best results of SNLI-hard and MNLI-hard.

show abstract

“…Biases in NLI Table 5: Results on the NLI adversarial test benchmark (Liu et al, 2020b). We compare with the data augmentation techniques investigated by Liu et al (2020b).…”

Section: Adversarial Tests For Combating Distinctmentioning

confidence: 99%

Section: Adversarial Tests For Combating Distinctmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets

Wu¹,

Gardner²,

Stenetorp³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The trigger bias proposed in our paper belongs to selection bias and model overamplification bias. Bias has also been investigated in natural language inference [1,6,7,13,[21][22][23], question answering [24], ROC story cloze [2,28], lexical inference [17], visual question answering [12], etc. To our best knowledge, we are the first to present the biases in FSEC, i.e., trigger overlapping and trigger separability.…”

Section: Few-shot Event Classificationmentioning

confidence: 99%

Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification

Wang¹,

Xu²,

Li³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Few-Shot Event Classification (FSEC) aims at developing a model for event prediction, which can generalize to new event types with a limited number of annotated data. Existing FSEC studies have achieved high accuracy on different benchmarks. However, we find they suffer from trigger biases that signify the statistical homogeneity between some trigger words and target event types, which we summarize as trigger overlapping and trigger separability. The biases can result in context-bypassing problem, i.e., correct classifications can be gained by looking at only the trigger words while ignoring the entire context. Therefore, existing models can be weak in generalizing to unseen data in real scenarios. To further uncover the trigger biases and assess the generalization ability of the models, we propose two new sampling methods, Trigger-Uniform Sampling (TUS) and COnfusion Sampling (COS), for the meta tasks construction during evaluation. Besides, to cope with the context-bypassing problem in FSEC models, we introduce adversarial training and trigger reconstruction techniques. Experiments show these techniques help not only improve the performance, but also enhance the generalization ability of models. Our data and code is

show abstract

“…Such methods can be roughly categorized into two classes: sentence embedding bottleneck methods which first encode the two sentences as vectors and then feed them into a classifier for classification (Conneau et al, 2017;Nie and Bansal, 2017;Choi et al, 2018;Chen et al, 2017b;Wu et al, 2018), and more general methods which usually involve interactions while encoding the two sentences in the pair (Chen et al, 2017a;Gong et al, 2018;Parikh et al, 2016). Recently, NLI models are shown to be biased towards spurious surface patterns in the human annotated datasets (Poliak et al, 2018;Gururangan et al, 2018;Liu et al, 2020a), which makes them vulnerable to adversarial attacks (Glockner et al, 2018;Minervini and Riedel, 2018;McCoy et al, 2019;Liu et al, 2020b).…”

Section: Natural Language Inferencementioning

confidence: 99%

Discriminatively-Tuned Generative Classifiers for Robust Natural Language Inference

Ding

Chang

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

While discriminative neural network classifiers are generally preferred, recent work has shown advantages of generative classifiers in term of data efficiency and robustness. In this paper, we focus on natural language inference (NLI). We propose GenNLI, a generative classifier for NLI tasks, and empirically characterize its performance by comparing it to five baselines, including discriminative models and large-scale pretrained language representation models like BERT. We explore training objectives for discriminative fine-tuning of our generative classifiers, showing improvements over log loss fine-tuning from prior work (Lewis and Fan, 2019). In particular, we find strong results with a simple unbounded modification to log loss, which we call the "infinilog loss". Our experiments show that GenNLI outperforms both discriminative and pretrained baselines across several challenging NLI experimental settings, including small training sets, imbalanced label distributions, and label noise. * Equal contribution. † Contribution during visiting TTIC.

show abstract

An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference

Cited by 15 publications

References 43 publications

Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets

Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets

Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification

Discriminatively-Tuned Generative Classifiers for Robust Natural Language Inference

Contact Info

Product

Resources

About