2020
DOI: 10.1162/tacl_a_00321
|View full text |Cite|
|
Sign up to set email alerts
|

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Abstract: We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP), 1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggrega… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
298
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 219 publications
(299 citation statements)
references
References 37 publications
1
298
0
Order By: Relevance
“…Given the differences in pre-training strategies, we speculate that pre-training with more data might benefit model robustness against noised data. This speculation is consistent with (Warstadt et al, 2019b), where the authors also give a lightweight demonstration on LSTM and Transformer-XL (Dai et al, 2019) with varying training data. We leave a further exploration of this speculation and a detailed analysis of model architecture to future work.…”
Section: How Grammatical Errors Affect Downstream Performance?supporting
confidence: 82%
See 1 more Smart Citation
“…Given the differences in pre-training strategies, we speculate that pre-training with more data might benefit model robustness against noised data. This speculation is consistent with (Warstadt et al, 2019b), where the authors also give a lightweight demonstration on LSTM and Transformer-XL (Dai et al, 2019) with varying training data. We leave a further exploration of this speculation and a detailed analysis of model architecture to future work.…”
Section: How Grammatical Errors Affect Downstream Performance?supporting
confidence: 82%
“…In contrast, we propose a method to cover a broader range of grammatical errors and evaluate on downstream tasks. A concurrent work (Warstadt et al, 2019b) facilitates diagnosing language models by creating linguistic minimal pairs datasets for 67 isolate grammatical paradigms in English using linguistcrafted templates. In contrast, we do not rely heavily on artificial vocabulary and templates.…”
Section: Related Workmentioning
confidence: 99%
“…Challenge Sets The idea of creating challenging contrastive evaluation sets has a long history (Levesque et al, 2011;Ettinger et al, 2017;Glockner et al, 2018;Naik et al, 2018;Isabelle et al, 2017). Challenge sets exist for various phenomena, including ones with "minimal" edits similar to our contrast sets, e.g., in image captioning (Shekhar et al, 2017), machine translation (Sennrich, 2017;Burlot and Yvon, 2017;Burlot et al, 2018), and language modeling (Marvin and Linzen, 2018;Warstadt et al, 2019). Minimal pairs of edits that perturb gender or racial attributes are also useful for evaluating social biases Zhao et al, 2018;Lu et al, 2018).…”
Section: Training On Perturbed Examples Many Previous Work Have Provided Minimally Contrastivementioning
confidence: 99%
“…Two of the most popular analysis techniques are the behavioral and probing approaches. In the behavioral approach, a model is evaluated on a set of examples carefully chosen to require competence in particular linguistic phenomena (Marvin and Linzen, 2018;Wang et al, 2018;Dasgupta et al, 2019;Poliak et al, 2018;Linzen et al, 2016;McCoy et al, 2019b;Warstadt et al, 2020). This technique can illuminate behavioral shortcomings but says little about how the internal representations are struc-tured, treating the model as a black box.…”
Section: Analysis Of Nnsmentioning
confidence: 99%