2022
DOI: 10.48550/arxiv.2211.03495
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Abstract: The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, 1 a new probing method that replaces the input-dependent attention matrices with constant ones-the average attention weights over multiple inputs. We use PAPA to analyze several established pret… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(7 citation statements)
references
References 18 publications
1
6
0
Order By: Relevance
“…We then observe that the knowledge contained within MHSA undergoes more frequent changes compared to that within FFN. Combining previous findings of MHSA (Geva et al 2023;Wang et al 2022;Hassid et al 2022) with our observation, we believe that MHSA works as a knowledge extractor and stores certain general knowledge extraction patterns. This suggests the potential for supplementary optimization of TC hidden states of the MHSA to expand the function space, without necessitating updates to its weights.…”
Section: Introductionsupporting
confidence: 89%
See 3 more Smart Citations
“…We then observe that the knowledge contained within MHSA undergoes more frequent changes compared to that within FFN. Combining previous findings of MHSA (Geva et al 2023;Wang et al 2022;Hassid et al 2022) with our observation, we believe that MHSA works as a knowledge extractor and stores certain general knowledge extraction patterns. This suggests the potential for supplementary optimization of TC hidden states of the MHSA to expand the function space, without necessitating updates to its weights.…”
Section: Introductionsupporting
confidence: 89%
“…We attribute this observation to the fact that the MHSA continuously extracts various types of knowledge, while the FFN primarily extracts its own knowledge (Geva et al 2021;Meng et al 2022a). Furthermore, considering previous findings regarding the extraction of attributes from the MHSA with observed redundancies (Geva et al 2023;Wang et al 2022;Hassid et al 2022), we believe that the MHSA works as a knowledge extractor and stores certain general knowledge extraction patterns. Thus we suggest that when introducing new knowledge, there is no need to update the MHSA weights.…”
Section: Methodology Preliminariesmentioning
confidence: 55%
See 2 more Smart Citations
“…Recently, the Transformer has attracted significant interest in the computer vision community, thanks to its powerful representation capabilities. However, several works have found that the excellence of Transformer comes from the macro-level framework and advanced components rather than its self-attention (SA) mechanism to some extent [38,36,12]. Surprisingly, comparable results on multiple mainstream computer vision tasks can still be obtained by replacing SA with spatial pooling [38], spatial shifting [36], spatial MLP [33,32,34,13,23], fourier transform [31,17] and constant matrix [12], all of which have spatial information encoding capability sim- Following this motivation, our mainly purpose is to design a new spatial information encoder for encoding spatial features efficiently by introducing large kernel convolution operation and convolutional modulation technology to realize long-range correlations and selfadaptive behavior like SA, within the powerful Transformer macro framework.…”
Section: Introductionmentioning
confidence: 99%