Revealing the Dark Secrets of BERT

Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna

doi:10.18653/v1/d19-1445

Cited by 428 publications

(428 citation statements)

References 23 publications

Supporting

Mentioning

396

Contrasting

Order By: Relevance

“…Depending on the task and model architecture, attention may have more or less explanatory power for model predictions [35,51,57,71,79]. Visualization techniques have been used to convey the structure and properties of attention in Transformers [31,40,80,82]. Recent work has begun to apply attention to guide mapping of sequence models outside of the domain of natural language [70].…”

Section: Interpreting Models In Nlpmentioning

confidence: 99%

BERTology Meets Biology: Interpreting Attention in Protein Language Models

Vig

Madani

Varshney

et al. 2020

Preprint

199

169

View full text Add to dashboard Cite

Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at https://github.com/salesforce/provis.

show abstract

Section: Interpreting Models In Nlpmentioning

confidence: 99%

BERTology Meets Biology: Interpreting Attention in Protein Language Models

Vig

Madani

Varshney

et al. 2020

Preprint

199

169

View full text Add to dashboard Cite

show abstract

“…This study falls into the second category and is motivated by the observation that most selfattention patterns learned by the Transformer architecture merely reflect positional encoding of contextual information (Raganato and Tiedemann, 2018;Kovaleva et al, 2019;Voita et al, 2019a). Hence, we argue that most attentive connections in the encoder do not need to be learned at all, but can be replaced by simple predefined patterns.…”

Section: Introductionmentioning

confidence: 99%

An Analysis of Encoder Representations in Transformer-Based Machine Translation

Raganato¹,

Tiedemann²

2018

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

253

204

View full text Add to dashboard Cite

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that attention heads learn simple positional patterns which are often redundant. In this paper, we propose to replace all but one attention head of each encoder layer with fixed -non-learnable -attentive patterns that are solely based on position and do not require any external knowledge. Our experiments show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

show abstract

“…To predict task labels, we simply add a linear transformation layer on top of individual BERT outputs and use softmax function to normalize label vectors. It has been shown that BERT is a powerful representation method, which contains hierarchical lexical, syntactic and semantic knowledges [41]. Hence, we believe it is a strong baseline for comparison.…”

Section: Comparison Resultsmentioning

confidence: 98%

A transition-based neural framework for Chinese information extraction

Huang

Zhang

2020

PLoS ONE

View full text Add to dashboard Cite

Chinese information extraction is traditionally performed in the process of word segmentation, entity recognition, relation extraction and event detection. This pipelined approach suffers from two limitations: 1) It is prone to introduce propagated errors from upstream tasks to subsequent applications; 2) Mutual benefits of cross-task dependencies are hard to be introduced in non-overlapping models. To address these two challenges, we propose a novel transition-based model that jointly performs entity recognition, relation extraction and event detection as a single task. In addition, we incorporate subword-level information into character sequence with the use of a hybrid lattice structure, removing the reliance of external word tokenizers. Results on standard ACE benchmarks show the benefits of the proposed joint model and lattice network, which gives the best result in the literature.

show abstract

Revealing the Dark Secrets of BERT

Cited by 428 publications

References 23 publications

BERTology Meets Biology: Interpreting Attention in Protein Language Models

BERTology Meets Biology: Interpreting Attention in Protein Language Models

An Analysis of Encoder Representations in Transformer-Based Machine Translation

A transition-based neural framework for Chinese information extraction

Contact Info

Product

Resources

About