How is BERT surprised? Layerwise detection of linguistic anomalies

Li, Bai; Zhu, Zining; Thomas, Guillaume; Xu, Yang; Rudzicz, Frank

doi:10.18653/v1/2021.acl-long.325

Cited by 12 publications

(12 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another line of probing work designs control tasks (Ravichander et al, 2021;Lan et al, 2020) to reverse-engineer the internal mechanisms of representations (Kovaleva et al, 2019;. However, in contrast to our work, most studies (Zhong et al, 2021;Li et al, 2021; focused on pre-trained representations, not fine-tuned ones.…”

Section: Related Workmentioning

confidence: 84%

A Closer Look at How Fine-tuning Changes BERT

Zhou¹,

Srikumar²

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Given the prevalence of pre-trained contextualized representations in today's NLP, there have been many efforts to understand what information they contain, and why they seem to be universally successful. The most common approach to use these representations involves fine-tuning them for an end task. Yet, how fine-tuning changes the underlying embedding space is less studied. In this work, we study the English BERT family and use two probing techniques to analyze how fine-tuning changes the space. We hypothesize that fine-tuning affects classification performance by increasing the distances between examples associated with different labels. We confirm this hypothesis with carefully designed experiments on five different NLP tasks. Via these experiments, we also discover an exception to the prevailing wisdom that "fine-tuning always improves performance". Finally, by comparing the representations before and after fine-tuning, we discover that fine-tuning does not introduce arbitrary changes to representations; instead, it adjusts the representations to downstream tasks while largely preserving the original spatial structure of the data points.

show abstract

Section: Related Workmentioning

confidence: 84%

A Closer Look at How Fine-tuning Changes BERT

Zhou¹,

Srikumar²

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…To verify this possibility, we randomly sampled 1,000 sentences that contained "only N" and ever, respectively, from the Corpus of Contemporary American English (Davies, 2008-) and found their conditional probabilities are more or less balanced, e.g., P(ever|only) = 2.8% and P(only|ever) = 2.8%. In addition, Li et al (2021) recently showed by a layerwise model analysis that the effect of frequency information is strong only in the lower layers of Transformer language models like BERT but eventually decreases in the upper layers. Thus, we exclude the possibility that the unequal results for only in the two settings are simply an artifact of word frequencies.…”

Section: Resultsmentioning

confidence: 99%

Investigating a neural language model’s replicability of psycholinguistic experiments: A case study of NPI licensing

Shin

Song

2023

Front. Psychol.

View full text Add to dashboard Cite

The recent success of deep learning neural language models such as Bidirectional Encoder Representations from Transformers (BERT) has brought innovations to computational language research. The present study explores the possibility of using a language model in investigating human language processes, based on the case study of negative polarity items (NPIs). We first conducted an experiment with BERT to examine whether the model successfully captures the hierarchical structural relationship between an NPI and its licensor and whether it may lead to an error analogous to the grammatical illusion shown in the psycholinguistic experiment (Experiment 1). We also investigated whether the language model can capture the fine-grained semantic properties of NPI licensors and discriminate their subtle differences on the scale of licensing strengths (Experiment 2). The results of the two experiments suggest that overall, the neural language model is highly sensitive to both syntactic and semantic constraints in NPI processing. The model’s processing patterns and sensitivities are shown to be very close to humans, suggesting their role as a research tool or object in the study of language.

show abstract

“…Note that there are also many probing papers without post-hoc classifiers (Zhou and Srikumar, 2021;Torroba Hennigen et al, 2020;Li et al, 2021). While many of these do not mention the term "probing", they nevertheless probe the intrinsics of deep neural models.…”

Section: Probing Methodsmentioning

confidence: 99%

On the data requirements of probing

Zhu¹,

Wang²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

As large and powerful neural language models are developed, researchers have been increasingly interested in developing diagnostic tools to probe them. There are many papers with conclusions of the form "observation X is found in model Y ", using their own datasets with varying sizes. Larger probing datasets bring more reliability, but are also expensive to collect. There is yet to be a quantitative method for estimating reasonable probing dataset sizes. We tackle this omission in the context of comparing two probing configurations: after we have collected a small dataset from a pilot study, how many additional data samples are sufficient to distinguish two different configurations? We present a novel method to estimate the required number of data samples in such experiments and, across several case studies, we verify that our estimations have sufficient statistical power. Our framework helps to systematically construct probing datasets to diagnose neural NLP models.

show abstract

How is BERT surprised? Layerwise detection of linguistic anomalies

Cited by 12 publications

References 34 publications

A Closer Look at How Fine-tuning Changes BERT

A Closer Look at How Fine-tuning Changes BERT

Investigating a neural language model’s replicability of psycholinguistic experiments: A case study of NPI licensing

On the data requirements of probing

Contact Info

Product

Resources

About