What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models

Vries, Wietse de; Cranenburgh, Andreas van; Nissim, Malvina

doi:10.18653/v1/2020.findings-emnlp.389

Cited by 34 publications

(20 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several model probing works have revealed that the scalar mixing method introduced by Peters et al (2018a) allows for combining information from all layers with improved performance on lexico-semantic tasks (Liu et al 2019a;de Vries, van Cranenburgh, and Nissim 2020). However, scalar mixing essentially involves training a learned probe, which can limit attempts at analysing the inherent semantic space represented by NLMs (Mickus et al 2020).…”

Section: Resultsmentioning

confidence: 99%

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Loureiro

Rezaee

Pilehvar

et al. 2021

Computational Linguistics

View full text Add to dashboard Cite

Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

show abstract

Section: Resultsmentioning

confidence: 99%

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Loureiro

Rezaee

Pilehvar

et al. 2021

Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…In this work, we carry out an analysis on three popular language models with totally different pretraining objectives: BERT (masked language modeling), XLNet (permuted language modeling, , and ELECTRA (replaced token detection, Clark et al, 2020). We also show that the "weight mixing" evaluation strategy of Tenney et al (2019a), which is widely used in the context of probing (de Vries et al, 2020;Kuznetsov and Gurevych, 2020;Choenni and Shutova, 2020, inter alia), might not be a reliable basis for drawing conclusions in the layer-wise cross model analysis as it does not take into account the norm disparity across the representations of different layers. Instead, we perform an information-theoretic probing analysis using Minimum Description Length proposed by Voita and Titov (2020).…”

Section: Introductionmentioning

confidence: 95%

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations

Fayyaz¹,

Aghazadeh²,

Modarressi³

et al. 2021

Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy-which is widely used in the context of layer-wise probing-can be misleading given the norm disparity of the representations across different layers. Instead, we adopt an alternative information-theoretic probing with minimum description length, which has recently been proven to provide more reliable and informative results.

show abstract

“…The BERT is currently one of the most effective language models in terms of performance when different NLP tasks like text classification are concerned. The previous research has shown how BERT captures the language context in an efficient way [17], [18], [19].…”

Section: Related Workmentioning

confidence: 99%

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Velankar¹,

Patil²,

Joshi³

2022

Preprint

View full text Add to dashboard Cite

Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis, and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT, and xlm-RoBERTa and compare them with MahaBERT, MahaALBERT, and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants in five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT-based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out-of-domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.

show abstract

What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models

Cited by 34 publications

References 8 publications

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Contact Info

Product

Resources

About