2021
DOI: 10.48550/arxiv.2106.01335
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 20 publications
0
7
0
Order By: Relevance
“…In those attention focus drifting heads, the average attention weights' value from other tokens to trig- ger tokens in poisoned samples is very large even though the attention sparsity properties in normal transformer models (Ji et al, 2021). Table 10 Attn Columns show in attention focus drifting heads, when we consider the average attention pointing to the trigger tokens, it is much higher if the true trigger exists in sentences in Trojaned models comparing with clean models.…”
Section: E1 Attention Weightsmentioning
confidence: 97%
See 2 more Smart Citations
“…In those attention focus drifting heads, the average attention weights' value from other tokens to trig- ger tokens in poisoned samples is very large even though the attention sparsity properties in normal transformer models (Ji et al, 2021). Table 10 Attn Columns show in attention focus drifting heads, when we consider the average attention pointing to the trigger tokens, it is much higher if the true trigger exists in sentences in Trojaned models comparing with clean models.…”
Section: E1 Attention Weightsmentioning
confidence: 97%
“…The multi-head attention in BERT (Devlin et al, 2019;Vaswani et al, 2017) has shown to make more efficient use of the model capacity. Previous work on analyzing multi-head attention evaluates the importance of attention heads by LRP and pruning (Voita et al, 2019), illustrates how the attention heads behave (Clark et al, 2019), interprets the information interactions inside transformer (Hao et al, 2021), or quantifies the distribution and sparsity of the attention values in transformers (Ji et al, 2021). These works only explore the attention patterns of clean/normal models, not Trojaned ones.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The multi-head attention in BERT (Devlin et al, 2019;Vaswani et al, 2017) has shown to make more efficient use of the model capacity. Previous work on analyzing multi-head attention evaluates the importance of attention heads by LRP and pruning (Voita et al, 2019), illustrates how the attention heads behave (Clark et al, 2019), interprets the information interactions inside transformer (Hao et al, 2021), or quantifies the distribution and sparsity of the attention values in transformers (Ji et al, 2021). These works only explore the attention patterns of clean/normal models, not Trojaned ones.…”
Section: Related Workmentioning
confidence: 99%
“…The multi-head attention in Transformer (Vaswani et al, 2017) was shown to make more efficient use of the model capacity. Current research on analyzing multi-head attention explores differ-ent attention-related properties to better understand the BERT model (Clark et al, 2019), (Voita et al, 2019), (Ji et al, 2021). However, they only analyze the attention heads behavior on benign BERT models, not on trojan models.…”
Section: Introductionmentioning
confidence: 99%