Errudite: Scalable, Reproducible, and Testable Error Analysis

Wu, Tongshuang; Ribeiro, Marco Túlio; Heer, Jeffrey; Weld, Daniel S.

doi:10.18653/v1/p19-1073

Cited by 104 publications

(62 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A similar analysis on DROP shows that MTMSN does substantially worse on event re-ordering (47.3 F 1 ) than on adding compositional reasoning steps (67.5 F 1 ). We recommend authors categorize their perturbations up front in order to simplify future analyses and bypass some of the pitfalls of post-hoc error categorization (Wu et al, 2019). Additionally, it's worth discussing the dependency parsing result.…”

Section: Fine-grained Analysis Of Contrast Setsmentioning

confidence: 99%

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Gardner

Artzi

Basmov³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

193

182

View full text Add to dashboard Cite

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets-up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

show abstract

Section: Fine-grained Analysis Of Contrast Setsmentioning

confidence: 99%

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Gardner

Artzi

Basmov³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

193

182

View full text Add to dashboard Cite

show abstract

“…The grouping ensures that we do not mistakenly prioritize groups that are actually well-handled on average. We follow the approach proposed by Wu et al (2019), and extend their Errudite framework 10 to the relation extraction task. After formulating a hypothesis, we assess the error prevalence over the entire dataset split to validate whether the hypothesis holds, i.e.…”

Section: Error Hypotheses Formulation and Adversarial Rewritingmentioning

confidence: 99%

“…in intermediate layers. More similar to our approach is rewriting of instances (Jia and Liang, 2017;Ribeiro et al, 2018) but instead of evaluating model robustness we use rewriting to test explicit error hypotheses, similar to Wu et al (2019).…”

Section: Related Workmentioning

confidence: 99%

“…To answer the second question, we carry out two analyses: (1) we conduct a manual explorative analysis of model misclassifications on the most challenging test instances and categorize them into several linguistically motivated error categories; (2) we formulate these categories into testable hypotheses, which we can automatically validate on the full test set by adversarial rewriting -removing the suspected cause of error and observing the change in model prediction (Wu et al, 2019). We find that two groups of ambiguous relations are responsible for most of the remaining errors.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

Alt¹,

Gabryszak²,

Hennig³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

101

View full text Add to dashboard Cite

TACRED (Zhang et al., 2017) is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE). But, even with recent advances in unsupervised pretraining and knowledge enhanced neural RE, models still show a high error rate. In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? And how do crowd annotations, dataset, and models contribute to this error rate? To answer these questions, we first validate the most challenging 5K examples in the development and test sets using trained annotators. We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled. On the relabeled test set the average F1 score of a large baseline model set improves from 62.1 to 70.1. After validation, we analyze misclassifications on the challenging instances, categorize them into linguistically motivated error groups, and verify the resulting error hypotheses on three state-of-the-art RE models. We show that two groups of ambiguous relations are responsible for most of the remaining errors and that models may adopt shallow heuristics on the dataset when entities are not masked.

show abstract

“…In contrast to those two tools, our tool offers visualizations of a variety of statistically interesting aspects of data splits in order to better understand model behaviours. Wu et al (2019) provide an interactive tool for error analysis called ERRUDITE. 6 It supports, i.a., automated counterfactual rewriting for testing hypotheses about errors.…”

Section: Tools For Analyzing Nlp Modelsmentioning

confidence: 99%

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation

Wecker¹,

Friedrich²,

Adel³

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

This paper adds to the ongoing discussion in the natural language processing community on how to choose a good development set. Motivated by the real-life necessity of applying machine learning models to different data distributions, we propose a clustering-based data splitting algorithm. It creates development (or test) sets which are lexically different from the training data while ensuring similar label distributions. Hence, we are able to create challenging cross-validation evaluation setups while abstracting away from performance differences resulting from label distribution shifts between training and test data. In addition, we present a Python-based tool for analyzing and visualizing data split characteristics and model performance. We illustrate the workings and results of our approach using a sentiment analysis and a patent classification task.

show abstract

Errudite: Scalable, Reproducible, and Testable Error Analysis

Cited by 104 publications

References 32 publications

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Evaluating Models’ Local Decision Boundaries via Contrast Sets

TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation

Contact Info

Product

Resources

About