BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Warstadt, Alex; Parrish, Alicia; Liu, Haokun; Mohananey, Anhad; Peng, Wei; Wang, Sheng Fu; Bowman, Samuel R.

doi:10.1162/tacl_a_00321

Cited by 219 publications

(299 citation statements)

References 37 publications

Supporting

Mentioning

298

Contrasting

Order By: Relevance

“…Given the differences in pre-training strategies, we speculate that pre-training with more data might benefit model robustness against noised data. This speculation is consistent with (Warstadt et al, 2019b), where the authors also give a lightweight demonstration on LSTM and Transformer-XL (Dai et al, 2019) with varying training data. We leave a further exploration of this speculation and a detailed analysis of model architecture to future work.…”

Section: How Grammatical Errors Affect Downstream Performance?supporting

confidence: 82%

“…In contrast, we propose a method to cover a broader range of grammatical errors and evaluate on downstream tasks. A concurrent work (Warstadt et al, 2019b) facilitates diagnosing language models by creating linguistic minimal pairs datasets for 67 isolate grammatical paradigms in English using linguistcrafted templates. In contrast, we do not rely heavily on artificial vocabulary and templates.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

On the Robustness of Language Encoders against Grammatical Errors

Yin¹,

Long²,

Meng³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.

show abstract

Section: How Grammatical Errors Affect Downstream Performance?supporting

confidence: 82%

Section: Related Workmentioning

confidence: 99%

On the Robustness of Language Encoders against Grammatical Errors

Yin¹,

Long²,

Meng³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Challenge Sets The idea of creating challenging contrastive evaluation sets has a long history (Levesque et al, 2011;Ettinger et al, 2017;Glockner et al, 2018;Naik et al, 2018;Isabelle et al, 2017). Challenge sets exist for various phenomena, including ones with "minimal" edits similar to our contrast sets, e.g., in image captioning (Shekhar et al, 2017), machine translation (Sennrich, 2017;Burlot and Yvon, 2017;Burlot et al, 2018), and language modeling (Marvin and Linzen, 2018;Warstadt et al, 2019). Minimal pairs of edits that perturb gender or racial attributes are also useful for evaluating social biases Zhao et al, 2018;Lu et al, 2018).…”

Section: Training On Perturbed Examples Many Previous Work Have Provided Minimally Contrastivementioning

confidence: 99%

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Gardner

Artzi

Basmov³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

194

182

View full text Add to dashboard Cite

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets-up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

show abstract

“…Two of the most popular analysis techniques are the behavioral and probing approaches. In the behavioral approach, a model is evaluated on a set of examples carefully chosen to require competence in particular linguistic phenomena (Marvin and Linzen, 2018;Wang et al, 2018;Dasgupta et al, 2019;Poliak et al, 2018;Linzen et al, 2016;McCoy et al, 2019b;Warstadt et al, 2020). This technique can illuminate behavioral shortcomings but says little about how the internal representations are struc-tured, treating the model as a black box.…”

Section: Analysis Of Nnsmentioning

confidence: 99%

Discovering the Compositional Structure of Vector Representations with Role Learning Networks

Soulos¹,

McCoy²,

Linzen³

et al. 2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

How can neural networks perform so well on compositional tasks even though they lack explicit compositional representations? We use a novel analysis technique called ROLE to show that recurrent neural networks perform well on such tasks by converging to solutions which implicitly represent symbolic structure. This method uncovers a symbolic structure which, when properly embedded in vector space, closely approximates the encodings of a standard seq2seq network trained to perform the compositional SCAN task. We verify the causal importance of the discovered symbolic structure by showing that, when we systematically manipulate hidden embeddings based on this symbolic structure, the model's output is changed in the way predicted by our analysis. Goal: Interpret neural network encodings jump and run twice JUMP RUN RUN RNN Decoder RNN Encoder Encoding jump and run left twice JUMP LTURN RUN LTURN RUNMethod: Approximate the encodings of a neural network with a more interpretable compositional model ( §4)Step 1: Assign structural roles to words using a learned role assigner.Step 2: Combine word and role vectors using a closed-form equation with learned parameters.

show abstract

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Cited by 219 publications

References 37 publications

On the Robustness of Language Encoders against Grammatical Errors

On the Robustness of Language Encoders against Grammatical Errors

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Discovering the Compositional Structure of Vector Representations with Role Learning Networks

Contact Info

Product

Resources

About