Open Sesame: Getting Inside BERT's Linguistic Knowledge

Lin, Yongjie; Tan, Yi Chern; Frank, Robert

doi:10.48550/arxiv.1906.01698

Cited by 27 publications

(32 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…experimentally investigated the power of self-attention to extract word order information, finding differences between recurrent and self-attention models; however, these were modulated by the training objective. Lin et al (2019) and Tenney et al (2019) show that BERT (Devlin et al, 2018) encodes syntactic informa-tion.…”

Section: Related Workmentioning

confidence: 99%

“…Consequently, many researchers have studied the capability of recurrent neural network models to capture context-free languages (e.g., Kalinke and Lehmann (1998); Gers and Schmidhuber (2001); Grüning (2006); Weiss et al (2018); Sennhauser and Berwick (2018)) and linguistic phenomena involving hierarchical structure (e.g., Linzen et al (2016); Gulordava et al (2018)). Some experimental evidence suggests that transformers might not be as strong as LSTMs at modeling hierarchical structure (Tran et al, 2018), though analysis studies have shown that transformer-based models encode a good amount of syntactic knowledge (e.g., Clark et al (2019); Lin et al (2019); Tenney et al (2019)).…”

mentioning

confidence: 99%

See 1 more Smart Citation

Theoretical Limitations of Self-Attention in Neural Sequence Models

Hahn

2019

Preprint

View full text Add to dashboard Cite

Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through selfattention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. Our results precisely describe theoretical limitations of the techniques underlying recent advances in NLP.

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Theoretical Limitations of Self-Attention in Neural Sequence Models

Hahn

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…the contextualized representations that these LM compute, revealing that they encode substantial amounts of syntax and semantics (Linzen et al, 2016b;Peters et al, 2018b;Tenney et al, 2019b;Goldberg, 2019;Hewitt and Manning, 2019;Tenney et al, 2019a;Lin et al, 2019;Coenen et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

oLMpics -- On what Language Model Pre-training Captures

Talmor¹,

Elazar²,

Goldberg³

et al. 2019

Preprint

View full text Add to dashboard Cite

Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., ROBERTA succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while ROBERTA can compare ages, it can do so only when the ages are in the typical range of human ages; (c) On half of our reasoning tasks all models fail completely. Our findings and infrastructure can help future work on designing new datasets, models and objective functions for pre-training.

show abstract

“…To discuss the contribution of the short-term properties to the representative capability for NLP tasks, we also measure the performances in the short-term range with the following three tasks: MLM task, semantic textual similarity benchmark (STS-B) [17], and handwriting task (see the Appendix for the detailed setups). These layerwise analyses are similar to those in [18] which evaluates BERT performance, and our study inspects the properties for wider time range. In parallel, we investigate the system's global properties in the long term analysis.…”

Section: A Albert As "The Reservoir"mentioning

confidence: 77%

Transient Chaos in BERT

Inoue,

Ohara,

Kuniyoshi

et al. 2021

Preprint

View full text Add to dashboard Cite

Language is an outcome of our complex and dynamic human-interactions and the technique of natural language processing (NLP) is hence built on human linguistic activities. Along with Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT) has recently gained its popularity, owing to its outstanding NLP capability, by establishing the state-of-the-art scores in several NLP benchmarks. A Lite BERT (ALBERT) is literally characterized as a lightweight version of BERT, in which the number of BERT parameters is reduced by repeatedly applying the same neural network called Transformer's encoder layer. By pre-training the parameters with a massive amount of natural language data, ALBERT can convert input sentences into versatile high-dimensional vectors potentially capable of solving multiple NLP tasks. In that sense, ALBERT can be regarded as a well-designed high-dimensional dynamical system whose operator is the Transformer's encoder, and essential structures of human language are thus expected to be encapsulated in its dynamics. In this study, we investigated the embedded properties of ALBERT to reveal how NLP tasks are effectively solved by exploiting its dynamics. We thereby aimed to explore the nature of human language from the dynamical expressions of the NLP model. Our analysis consists of two parts, namely short-and longterm analyses, according to time-scale differences to capture the dynamics. Our short-term analysis clarified that the pre-trained model stably yields trajectories with higher dimensionality in a certain time range, which would enhance the expressive capacity required for NLP tasks. Also, our long-term analysis revealed that AL-BERT intrinsically shows transient chaos, a typical nonlinear phenomenon showing chaotic dynamics only in its transient, and the pre-trained ALBERT model tends to produce the chaotic trajectory for a significantly longer time period compared to a randomly-initialized one. Our results imply that local chaoticity would contribute to improving NLP performance, uncovering a novel aspect in the role of chaotic dynamics in human language behaviors.

show abstract

Open Sesame: Getting Inside BERT's Linguistic Knowledge

Cited by 27 publications

References 19 publications

Theoretical Limitations of Self-Attention in Neural Sequence Models

Theoretical Limitations of Self-Attention in Neural Sequence Models

oLMpics -- On what Language Model Pre-training Captures

Transient Chaos in BERT

Contact Info

Product

Resources

About