2020
DOI: 10.1162/tacl_a_00349
|View full text |Cite
|
Sign up to set email alerts
|

A Primer in BERTology: What We Know About How BERT Works

Abstract: Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for futu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

21
590
0
4

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 980 publications
(750 citation statements)
references
References 121 publications
21
590
0
4
Order By: Relevance
“…With the established impressive performance of large pre-trained language models (Devlin et al, 2019;Liu et al, 2019b), based on the Transformer architecture (Vaswani et al, 2017), a large body of work started studying and gaining insight into how these models work and what do they encode. 15 For a thorough summary of these advancements we refer the reader to a recent primer on the subject (Rogers et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…With the established impressive performance of large pre-trained language models (Devlin et al, 2019;Liu et al, 2019b), based on the Transformer architecture (Vaswani et al, 2017), a large body of work started studying and gaining insight into how these models work and what do they encode. 15 For a thorough summary of these advancements we refer the reader to a recent primer on the subject (Rogers et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…This model offers enhanced parallelization and better modeling of long-range dependencies in text and as such, has achieved state-of-the-art performance on a variety of tasks in NLP. Previous research (Jawahar et al, 2019;Rogers et al, 2021) has suggested that it encodes language information (lexical, syntactic etc.) that is known to be important for performing complex natural language tasks, including AD detection from speech.…”
Section: Transfer Learning-based Approachmentioning
confidence: 99%
“…For a complete overview of existing probe and analysis methods, the survey of Belinkov and Glass (2019) provides a synthesis of analysis studies on neural network methods. The more recent survey of Rogers, Kovaleva, and Rumshisky (2020) is a similar synthesis but targeted at BERT and its derivatives.…”
Section: Analysis Of Pretrained Language Modelsmentioning
confidence: 99%