A Primer in BERTology: What We Know About How BERT Works

Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna

doi:10.1162/tacl_a_00349

Cited by 980 publications

(750 citation statements)

References 121 publications

Supporting

Mentioning

590

Contrasting

Unclassified

Order By: Relevance

“…With the established impressive performance of large pre-trained language models (Devlin et al, 2019;Liu et al, 2019b), based on the Transformer architecture (Vaswani et al, 2017), a large body of work started studying and gaining insight into how these models work and what do they encode. 15 For a thorough summary of these advancements we refer the reader to a recent primer on the subject (Rogers et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Elazar

Ravfogel

Jacovi

et al. 2021

Transactions of the Association for Computational Linguistics

100

View full text Add to dashboard Cite

A growing body of work makes use of probing in order to investigate the working of neural models, often considered black boxes. Recently, an ongoing debate emerged surrounding the limitations of the probing paradigm. In this work, we point out the inability to infer behavioral conclusions from probing results, and offer an alternative method that focuses on how the information is being used, rather than on what information is encoded. Our method, Amnesic Probing, follows the intuition that the utility of a property for a given task can be assessed by measuring the influence of a causal intervention that removes it from the representation. Equipped with this new analysis tool, we can ask questions that were not possible before, for example, is part-of-speech information important for word prediction? We perform a series of analyses on BERT to answer these types of questions. Our findings demonstrate that conventional probing performance is not correlated to task importance, and we call for increased scrutiny of claims that draw behavioral or causal conclusions from probing results.1

show abstract

Section: Related Workmentioning

confidence: 99%

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Elazar

Ravfogel

Jacovi

et al. 2021

Transactions of the Association for Computational Linguistics

100

View full text Add to dashboard Cite

show abstract

“…This model offers enhanced parallelization and better modeling of long-range dependencies in text and as such, has achieved state-of-the-art performance on a variety of tasks in NLP. Previous research (Jawahar et al, 2019;Rogers et al, 2021) has suggested that it encodes language information (lexical, syntactic etc.) that is known to be important for performing complex natural language tasks, including AD detection from speech.…”

Section: Transfer Learning-based Approachmentioning

confidence: 99%

Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech

Balagopalan

Eyre²,

Robin³

et al. 2021

Front. Aging Neurosci.

View full text Add to dashboard Cite

Introduction: Research related to the automatic detection of Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional diagnostic methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing, and machine learning provide promising techniques for reliably detecting AD. There has been a recent proliferation of classification models for AD, but these vary in the datasets used, model types and training and testing paradigms. In this study, we compare and contrast the performance of two common approaches for automatic AD detection from speech on the same, well-matched dataset, to determine the advantages of using domain knowledge vs. pre-trained transfer models.Methods: Audio recordings and corresponding manually-transcribed speech transcripts of a picture description task administered to 156 demographically matched older adults, 78 with Alzheimer's Disease (AD) and 78 cognitively intact (healthy) were classified using machine learning and natural language processing as “AD” or “non-AD.” The audio was acoustically-enhanced, and post-processed to improve quality of the speech recording as well control for variation caused by recording conditions. Two approaches were used for classification of these speech samples: (1) using domain knowledge: extracting an extensive set of clinically relevant linguistic and acoustic features derived from speech and transcripts based on prior literature, and (2) using transfer-learning and leveraging large pre-trained machine learning models: using transcript-representations that are automatically derived from state-of-the-art pre-trained language models, by fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models.Results: We compared the utility of speech transcript representations obtained from recent natural language processing models (i.e., BERT) to more clinically-interpretable language feature-based methods. Both the feature-based approaches and fine-tuned BERT models significantly outperformed the baseline linguistic model using a small set of linguistic features, demonstrating the importance of extensive linguistic information for detecting cognitive impairments relating to AD. We observed that fine-tuned BERT models numerically outperformed feature-based approaches on the AD detection task, but the difference was not statistically significant. Our main contribution is the observation that when tested on the same, demographically balanced dataset and tested on independent, unseen data, both domain knowledge and pretrained linguistic models have good predictive performance for detecting AD based on speech. It is notable that linguistic information alone is capable of achieving comparable, and even numerically better, performance than models including both acoustic and linguistic features here. We also try to shed light on the inner workings of the more black-box natural language processing model by performing an interpretability analysis, and find that attention weights reveal interesting patterns such as higher attribution to more important information content units in the picture description task, as well as pauses and filler words.Conclusion: This approach supports the value of well-performing machine learning and linguistically-focussed processing techniques to detect AD from speech and highlights the need to compare model performance on carefully balanced datasets, using consistent same training parameters and independent test datasets in order to determine the best performing predictive model.

show abstract

“…For a complete overview of existing probe and analysis methods, the survey of Belinkov and Glass (2019) provides a synthesis of analysis studies on neural network methods. The more recent survey of Rogers, Kovaleva, and Rumshisky (2020) is a similar synthesis but targeted at BERT and its derivatives.…”

Section: Analysis Of Pretrained Language Modelsmentioning

confidence: 99%

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Loureiro

Rezaee

Pilehvar

et al. 2021

Computational Linguistics

View full text Add to dashboard Cite

Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

show abstract

A Primer in BERTology: What We Know About How BERT Works

Cited by 980 publications

References 121 publications

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Contact Info

Product

Resources

About