When Do You Need Billions of Words of Pretraining Data?

Zhang, Yian; Warstadt, Alex; Li, Xiaocheng; Bowman, Samuel R.

doi:10.18653/v1/2021.acl-long.90

Cited by 52 publications

(52 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Even if part of the probing performance is due to the classifier itself we claim that the difference in the probing performance will be due to the difference in the amount of linguistic knowledge encoded in the representations we manipulate. This conjecture is strengthened by the findings of Zhang et al (2020) who analysed the representations from pretrained miniBERTas 3 and demonstrated that the trends found through edge probing (Tenney et al, 2019b) are the same as those found through better designed probes such as Minimum Description Length (Voita & Titov, 2020). Therefore in our work we adopt edge probing and structural probing for contextualized embeddings.…”

Section: Measuring the Amount Of Linguistic Knowledgesupporting

confidence: 56%

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Nikoulina,

Tezekbayev,

Kozhakhmet

et al. 2021

Preprint

View full text Add to dashboard Cite

There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper we study whether linguistic knowledge is a necessary condition for good performance of modern language models, which we call the rediscovery hypothesis.In the first place we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objective with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real tasks.

show abstract

Section: Measuring the Amount Of Linguistic Knowledgesupporting

confidence: 56%

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Nikoulina,

Tezekbayev,

Kozhakhmet

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This finding agrees with Warstadt et al (2020b), who found that larger LMs have an inductive bias towards linguistic generalizations, while smaller LMs have an inductive bias towards surface generalizations; this may explain the success of large LMs on downstream tasks. A small quantity of data (10M tokens) is sufficient for LMs to prefer the constructional sort, indicating that ASCs are relatively easy to learn: roughly on par with other types of linguistic knowledge, and requiring less data than commonsense knowledge (Zhang et al, 2021;Liu et al, 2021).…”

Section: Results and Interpretationmentioning

confidence: 99%

Neural reality of argument structure constructions

Li¹,

Zhu²,

Thomas³

et al. 2022

Preprint

View full text Add to dashboard Cite

In lexicalist linguistic theories, argument structure is assumed to be predictable from the meaning of verbs. As a result, the verb is the primary determinant of the meaning of a clause. In contrast, construction grammarians propose that argument structure is encoded in constructions (or form-meaning pairs) that are distinct from verbs. Decades of psycholinguistic research have produced substantial empirical evidence in favor of the construction view. Here we adapt several psycholinguistic studies to probe for the existence of argument structure constructions (ASCs) in Transformerbased language models (LMs). First, using a sentence sorting experiment, we find that sentences sharing the same construction are closer in embedding space than sentences sharing the same verb. Furthermore, LMs increasingly prefer grouping by construction with more input data, mirroring the behaviour of non-native language learners. Second, in a "Jabberwocky" priming-based experiment, we find that LMs associate ASCs with meaning, even in semantically nonsensical sentences. Our work offers the first evidence for ASCs in LMs and highlights the potential to devise novel probing methods grounded in psycholinguistic research.

show abstract

“…Scaling laws (Kaplan et al, 2020) suggest that optimal compute-efficient training involves training very large models on a relatively modest amount of data and indeed, many studies explore the relationship between the volume of training data and consequent model performance (Banko and Brill, 2001;Sun et al, 2017;van Schijndel et al, 2019;Hu et al, 2020;Raffel et al, 2020;Brown et al, 2020;Zhang et al, 2021), generally concluding that rapid improvements in performance are observed as the amount of training data increases, at least until a certain point, after which the improvements slow down.…”

Section: Related Workmentioning

confidence: 99%

On the Role of Corpus Ordering in Language Modeling

Agrawal¹,

Singh²,

Schneider³

et al. 2021

Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Language models pretrained on vast corpora of unstructured text using self-supervised learning framework are used in numerous natural language understanding and generation tasks. Many studies show that language acquisition in humans follows a rather structured simple-to-complex pattern and guided by this intuition, curriculum learning, which enables training of computational models in a meaningful order, such as processing easy samples before hard ones, has been shown to potentially reduce training time. The question remains whether curriculum learning can benefit pretraining of language models. In this work, we perform comprehensive experiments involving multiple curricula strategies varying the criteria for complexity and the training schedules. Empirical results of training transformer language models on English corpus and evaluating it intrinsically as well as after fine-tuning across eight tasks from the GLUE benchmark, show consistent improvement gains over conventional vanilla training. Interestingly, in our experiments, when evaluated on one epoch, the best model following a document-level hard-to-easy curriculum, outperforms the vanilla model by 1.7 points (average GLUE score) and it takes the vanilla model twice as many training steps to reach comparable performance.

show abstract

When Do You Need Billions of Words of Pretraining Data?

Cited by 52 publications

References 38 publications

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Neural reality of argument structure constructions

On the Role of Corpus Ordering in Language Modeling

Contact Info

Product

Resources

About