Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.389
|View full text |Cite
|
Sign up to set email alerts
|

What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models

Abstract: Peeking into the inner workings of BERT has shown that its layers resemble the classical NLP pipeline, with progressively more complex tasks being concentrated in later layers.To investigate to what extent these results also hold for a language other than English, we probe a Dutch BERT-based model and the multilingual BERT model for Dutch NLP tasks. In addition, through a deeper analysis of partof-speech tagging, we show that also within a given task, information is spread over different parts of the network a… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 34 publications
(20 citation statements)
references
References 8 publications
0
20
0
Order By: Relevance
“…Several model probing works have revealed that the scalar mixing method introduced by Peters et al (2018a) allows for combining information from all layers with improved performance on lexico-semantic tasks (Liu et al 2019a;de Vries, van Cranenburgh, and Nissim 2020). However, scalar mixing essentially involves training a learned probe, which can limit attempts at analysing the inherent semantic space represented by NLMs (Mickus et al 2020).…”
Section: Resultsmentioning
confidence: 99%
“…Several model probing works have revealed that the scalar mixing method introduced by Peters et al (2018a) allows for combining information from all layers with improved performance on lexico-semantic tasks (Liu et al 2019a;de Vries, van Cranenburgh, and Nissim 2020). However, scalar mixing essentially involves training a learned probe, which can limit attempts at analysing the inherent semantic space represented by NLMs (Mickus et al 2020).…”
Section: Resultsmentioning
confidence: 99%
“…In this work, we carry out an analysis on three popular language models with totally different pretraining objectives: BERT (masked language modeling), XLNet (permuted language modeling, , and ELECTRA (replaced token detection, Clark et al, 2020). We also show that the "weight mixing" evaluation strategy of Tenney et al (2019a), which is widely used in the context of probing (de Vries et al, 2020;Kuznetsov and Gurevych, 2020;Choenni and Shutova, 2020, inter alia), might not be a reliable basis for drawing conclusions in the layer-wise cross model analysis as it does not take into account the norm disparity across the representations of different layers. Instead, we perform an information-theoretic probing analysis using Minimum Description Length proposed by Voita and Titov (2020).…”
Section: Introductionmentioning
confidence: 95%
“…The BERT is currently one of the most effective language models in terms of performance when different NLP tasks like text classification are concerned. The previous research has shown how BERT captures the language context in an efficient way [17], [18], [19].…”
Section: Related Workmentioning
confidence: 99%