Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.553
|View full text |Cite
|
Sign up to set email alerts
|

Pretrained Language Model Embryology: The Birth of ALBERT

Abstract: While behaviors of pretrained language models (LMs) have been thoroughly examined, what happened during pretraining is rarely studied. We thus investigate the developmental process from a set of randomly initialized parameters to a totipotent 1 language model, which we refer to as the embryology of a pretrained language model. Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining. We also find that linguistic kno… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
17
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 15 publications
0
17
0
Order By: Relevance
“…Hu et al (2020) find that GPT-2 trained on 42M words performs roughly as well on a syntax benchmark as a similar model trained on 100 times that amount. Other studies have investigated how one model's linguistic knowledge changes during the training process, as a function of the number of updates (Saphra and Lopez, 2019;Chiang et al, 2020). Raffel et al (2020) also investigate how performance on SuperGLUE (and other downstream tasks) improves with pretraining dataset size between about 8M and 34B tokens.…”
Section: Related Workmentioning
confidence: 99%
“…Hu et al (2020) find that GPT-2 trained on 42M words performs roughly as well on a syntax benchmark as a similar model trained on 100 times that amount. Other studies have investigated how one model's linguistic knowledge changes during the training process, as a function of the number of updates (Saphra and Lopez, 2019;Chiang et al, 2020). Raffel et al (2020) also investigate how performance on SuperGLUE (and other downstream tasks) improves with pretraining dataset size between about 8M and 34B tokens.…”
Section: Related Workmentioning
confidence: 99%
“…Many researchers have sought to interpret what kinds of knowledge are acquired during this "pretraining" phase (Clark et al, 2019;Hao et al, 2019;Kovaleva et al, 2019;Belinkov et al, 2020). Extending Chiang et al (2020), we systematically conduct probing across the pretraining iterations, to understand not just what is learned (as explored in numerous past analyses of fixed, already-trained * Equal contribution. models), but also when.…”
Section: Introductionmentioning
confidence: 99%
“…As fixed artifacts, they have become the object of intense study, with many researchers "probing" the extent to which they acquire and readily demonstrate linguistic abstractions, factual and commonsense knowledge, and reasoning abilities. Recent work applied several probes to intermediate training stages to observe the developmental process of a large-scale model (Chiang et al, 2020). Following this effort, we systematically answer a question: for various types of knowledge a language model learns, when during (pre)training are they acquired?…”
mentioning
confidence: 99%
“…However, many questions remain on how these models work and what they know about language. The previous research focuses on what knowledge has been learned during and after pre-training phases (Chiang et al, 2020;Rogers et al, 2020a), and how it is affected by fine-tuning (Gauthier and Levy, 2019;Peters et al, 2019;Miaschi et al, 2020;Merchant et al, 2020). Besides, a wide variety of language phenomena has been investigated including syntax (Hewitt and Manning, 2019a;Liu et al, 2019a), world knowledge (Petroni et al, 2019;Jiang et al, 2020), reasoning (van Aken et al, 2019), common sense understanding Klein and Nabi, 2019), and semantics (Ettinger, 2020).…”
Section: Introductionmentioning
confidence: 99%