Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.90
|View full text |Cite
|
Sign up to set email alerts
|

When Do You Need Billions of Words of Pretraining Data?

Abstract: NLP is currently dominated by language models like RoBERTa which are pretrained on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? To explore this question, we adopt five styles of evaluation: classifier probing, information-theoretic probing, unsupervised relative acceptability judgments, unsupervised language model knowledge probing, and fine-tuning on NLU tasks. We then draw learning curves that track the grow… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
45
1
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1
1

Relationship

2
8

Authors

Journals

citations
Cited by 52 publications
(52 citation statements)
references
References 38 publications
5
45
1
1
Order By: Relevance
“…Even if part of the probing performance is due to the classifier itself we claim that the difference in the probing performance will be due to the difference in the amount of linguistic knowledge encoded in the representations we manipulate. This conjecture is strengthened by the findings of Zhang et al (2020) who analysed the representations from pretrained miniBERTas 3 and demonstrated that the trends found through edge probing (Tenney et al, 2019b) are the same as those found through better designed probes such as Minimum Description Length (Voita & Titov, 2020). Therefore in our work we adopt edge probing and structural probing for contextualized embeddings.…”
Section: Measuring the Amount Of Linguistic Knowledgesupporting
confidence: 56%
“…Even if part of the probing performance is due to the classifier itself we claim that the difference in the probing performance will be due to the difference in the amount of linguistic knowledge encoded in the representations we manipulate. This conjecture is strengthened by the findings of Zhang et al (2020) who analysed the representations from pretrained miniBERTas 3 and demonstrated that the trends found through edge probing (Tenney et al, 2019b) are the same as those found through better designed probes such as Minimum Description Length (Voita & Titov, 2020). Therefore in our work we adopt edge probing and structural probing for contextualized embeddings.…”
Section: Measuring the Amount Of Linguistic Knowledgesupporting
confidence: 56%
“…This finding agrees with Warstadt et al (2020b), who found that larger LMs have an inductive bias towards linguistic generalizations, while smaller LMs have an inductive bias towards surface generalizations; this may explain the success of large LMs on downstream tasks. A small quantity of data (10M tokens) is sufficient for LMs to prefer the constructional sort, indicating that ASCs are relatively easy to learn: roughly on par with other types of linguistic knowledge, and requiring less data than commonsense knowledge (Zhang et al, 2021;Liu et al, 2021).…”
Section: Results and Interpretationmentioning
confidence: 99%
“…Scaling laws (Kaplan et al, 2020) suggest that optimal compute-efficient training involves training very large models on a relatively modest amount of data and indeed, many studies explore the relationship between the volume of training data and consequent model performance (Banko and Brill, 2001;Sun et al, 2017;van Schijndel et al, 2019;Hu et al, 2020;Raffel et al, 2020;Brown et al, 2020;Zhang et al, 2021), generally concluding that rapid improvements in performance are observed as the amount of training data increases, at least until a certain point, after which the improvements slow down.…”
Section: Related Workmentioning
confidence: 99%