Attention is Not Only a Weight: Analyzing Transformers with Vector Norms

Kobayashi, Gorô; Kuribayashi, Tatsuki; Yokoi, Sho; Inui, Kentaro

doi:10.18653/v1/2020.emnlp-main.574

Cited by 114 publications

(111 citation statements)

References 23 publications

(50 reference statements)

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…More recently, Kobayashi et al (2020) showed that the norms of attention-weighted input vectors, which yield a more intuitive interpretation of selfattention, reduce the attention to special tokens. However, even when the attention weights are normed, it is still not the case that most heads that do the ''heavy lifting'' are even potentially interpretable (Prasanna et al, 2020).…”

Section: Attention To Special Tokensmentioning

confidence: 99%

A Primer in BERTology: What We Know About How BERT Works

Rogers

Kovaleva

Rumshisky

2020

Transactions of the Association for Computational Linguistics

980

590

View full text Add to dashboard Cite

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.

show abstract

Section: Attention To Special Tokensmentioning

confidence: 99%

A Primer in BERTology: What We Know About How BERT Works

Rogers

Kovaleva

Rumshisky

2020

Transactions of the Association for Computational Linguistics

980

590

View full text Add to dashboard Cite

show abstract

“…As an alternative to analyzing attention weights, Kobayashi et al (2020) propose anayzing the norm of vectors produced by multiplying the outputs of the value matrix with the attention weights. Follow-ing the experimental setting of Clark et al (2019), i.e., by analyzing 992 sequences extracted from Wikipedia, their norm-based analysis also shows that the contributions of [SEP] and punctuations are actually small.…”

Section: Discussionmentioning

confidence: 99%

Effective Attention Sheds Light On Interpretability

Sun¹,

Marasovi²

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

An attention matrix of a transformer selfattention sublayer can provably be decomposed into two components and only one of them (effective attention) contributes to the model output. This leads us to ask whether visualizing effective attention gives different conclusions than interpretation of standard attention. Using a subset of the GLUE tasks and BERT, we carry out an analysis to compare the two attention matrices, and show that their interpretations differ. Effective attention is less associated with the features related to the language modeling pretraining such as the separator token, and it has more potential to illustrate linguistic features captured by the model for solving the end-task. Given the found differences, we recommend using effective attention for studying a transformer's behavior since it is more pertinent to the model output by design.

show abstract

“…Among the work that is relevant to encoder-decoder attentions, Michel et al (2019) and Voita et al (2019) observe that only a small portion of heads is relevant for translation and encoder-decoder attentions tend to be more important than self-attentions. Meanwhile, word alignments for machine translation are induced from encoder-decoder attention weights (Li et al, 2019;Kobayashi et al, 2020). However, none of prior work employs attentions to improve generation quality.…”

Section: Related Workmentioning

confidence: 99%

Attention Head Masking for Inference Time Content Selection in Abstractive Summarization

Cao¹,

Wang²

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

How can we effectively inform content selection in Transformer-based abstractive summarization models? In this work, we present a simple-yet-effective attention head masking technique, which is applied on encoderdecoder attentions to pinpoint salient content at inference time. Using attention head masking, we are able to reveal the relation between encoder-decoder attentions and content selection behaviors of summarization models. We then demonstrate its effectiveness on three document summarization datasets based on both in-domain and cross-domain settings. Importantly, our models outperform prior state-ofthe-art models on CNN/Daily Mail and New York Times datasets. Moreover, our inferencetime masking technique is also data-efficient, requiring less than 20% of the training samples to outperform BART fine-tuned on the full CNN/DailyMail dataset.

show abstract

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms

Cited by 114 publications

References 23 publications

A Primer in BERTology: What We Know About How BERT Works

A Primer in BERTology: What We Know About How BERT Works

Effective Attention Sheds Light On Interpretability

Attention Head Masking for Inference Time Content Selection in Abstractive Summarization

Contact Info

Product

Resources

About