Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1445
|View full text |Cite
|
Sign up to set email alerts
|

Revealing the Dark Secrets of BERT

Abstract: BERT-based architectures currently give stateof-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of selfattention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT's heads. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

29
396
3

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
3
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 428 publications
(428 citation statements)
references
References 23 publications
29
396
3
Order By: Relevance
“…Depending on the task and model architecture, attention may have more or less explanatory power for model predictions [35,51,57,71,79]. Visualization techniques have been used to convey the structure and properties of attention in Transformers [31,40,80,82]. Recent work has begun to apply attention to guide mapping of sequence models outside of the domain of natural language [70].…”
Section: Interpreting Models In Nlpmentioning
confidence: 99%
“…Depending on the task and model architecture, attention may have more or less explanatory power for model predictions [35,51,57,71,79]. Visualization techniques have been used to convey the structure and properties of attention in Transformers [31,40,80,82]. Recent work has begun to apply attention to guide mapping of sequence models outside of the domain of natural language [70].…”
Section: Interpreting Models In Nlpmentioning
confidence: 99%
“…This study falls into the second category and is motivated by the observation that most selfattention patterns learned by the Transformer architecture merely reflect positional encoding of contextual information (Raganato and Tiedemann, 2018;Kovaleva et al, 2019;Voita et al, 2019a). Hence, we argue that most attentive connections in the encoder do not need to be learned at all, but can be replaced by simple predefined patterns.…”
Section: Introductionmentioning
confidence: 99%
“…To predict task labels, we simply add a linear transformation layer on top of individual BERT outputs and use softmax function to normalize label vectors. It has been shown that BERT is a powerful representation method, which contains hierarchical lexical, syntactic and semantic knowledges [41]. Hence, we believe it is a strong baseline for comparison.…”
Section: Comparison Resultsmentioning
confidence: 98%