Pretrained language models, such as BERT and RoBERTa, have shown large improvements in the commonsense reasoning benchmark COPA. However, recent work found that many improvements in benchmarks of natural language understanding are not due to models learning the task, but due to their increasing ability to exploit superficial cues, such as tokens that occur more often in the correct answer than the wrong one. Are BERT's and RoBERTa's good performance on COPA also caused by this? We find superficial cues in COPA, as well as evidence that BERT exploits these cues. To remedy this problem, we introduce Balanced COPA, an extension of COPA that does not suffer from easy-toexploit single token cues. We analyze BERT's and RoBERTa's performance on original and Balanced COPA, finding that BERT relies on superficial cues when they are present, but still achieves comparable performance once they are made ineffective, suggesting that BERT learns the task to a certain degree when forced to. In contrast, RoBERTa does not appear to rely on superficial cues.
Finetuning large pre-trained language models with a task-specific head has advanced the stateof-the-art on many natural language understanding benchmarks. However, models with a task-specific head require a lot of training data, making them susceptible to learning and exploiting dataset-specific superficial cues that do not generalize to other datasets. Prompting has reduced the data requirement by reusing the language model head and formatting the task input to match the pre-training objective. Therefore, it is expected that few-shot promptbased models do not exploit superficial cues. This paper presents an empirical examination of whether few-shot prompt-based models also exploit superficial cues. Analyzing few-shot prompt-based models on MNLI, SNLI, HANS, and COPA has revealed that prompt-based models also exploit superficial cues. While the models perform well on instances with superficial cues, they often underperform or only marginally outperform random accuracy on instances without superficial cues.
Improving model generalization on held-out data is one of the core objectives in commonsense reasoning. Recent work has shown that models trained on the dataset with superficial cues tend to perform well on the easy test set with superficial cues but perform poorly on the hard test set without superficial cues. Previous approaches have resorted to manual methods of encouraging models not to overfit to superficial cues. While some of the methods have improved performance on hard instances, they also lead to degraded performance on easy instances. Here, we propose to explicitly learn a model that does well on both the easy test set with superficial cues and hard test set without superficial cues. Using a meta-learning objective, we learn such a model that improves performance on both the easy test set and the hard test set. By evaluating our models on Choice of Plausible Alternatives (COPA) and Commonsense Explanation, we show that our proposed method leads to improved performance on both the easy test set and the hard test set upon which we observe up to 16.5 percentage points improvement over the baseline.
Pretrained language models have achieved remarkable performance on most of the natural language benchmarks. Until recently, the dominant approach of adapting these models to downstream tasks has been through finetuning a task-specific head. But, previous work has found that these models learn to exploit spurious correlations between inputs and the labels (Gururangan et al. 2018;Poliak et al. 2018;Kavumba et al. 2019). These spurious correlations may exist in the form of unique input tokens or style or annotation artifacts, which all fall under the wide umbrella of superficial cues (Kavumba et al. 2019). While superficial cues are strong predictors of the labels, they have nothing to do with the intended task. In software development, superficial cues can be viewed as bugs in the task design. They enable the task to be solved in non-intended ways, hence the dataset becomes a poor benchmark of the required ability or abilities. In causality, we can view superficial cues as confounding variables. Unfortunately, superficial cues tend to be unique to a single dataset. Thus, the model's remarkable performance on one dataset does not transfer to other datasets, i.e., the models tend not to be robust. In order to develop models that are truly robust, we need to identify and examine if the models are exploiting superficial cues. This is the first step to debug both the dataset and the model's. Only then can we train robust models that learn to be right for the right reasons (Kavumba et al. 2021).Recently, prompting has been shown to achieve competitive performance to models finetuned using a task-specific head. This is despite using a few task-specific finetuning examples. Given the little data that prompt-based models require to achieve optimal performance on downstream tasks, do they also learn to exploit superficial cues? Thus, we asked, Do prompt-based models exploit superficial cues? This is the research question explored in Kavumba et al. (2022). We presented a rigorous empirical investigation of prompt-based models and whether or not they also exploit superficial cues. We examined prompted-based models on two English language fundamental tasks of natural language understanding: natural language inference (NLI) and
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.