2021
DOI: 10.48550/arxiv.2101.09115
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

Abstract: Multi-headed attention heads are a mainstay in transformerbased models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores acros… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 25 publications
0
3
0
Order By: Relevance
“…Which heads should we flip? We take motivation from studies that show that middle layers of BERT capture syntactic relations (Hewitt and Manning, 2019;Goldberg, 2019) and are multi-skilled (Pande et al, 2021), making them crucial for prediction. In contrast, the initial layers are responsible for phraselevel understanding while the last few layers are highly task-specific (Jawahar et al, 2019).…”
Section: Features From Flipping Heads In Ias F Flipmentioning
confidence: 99%
See 2 more Smart Citations
“…Which heads should we flip? We take motivation from studies that show that middle layers of BERT capture syntactic relations (Hewitt and Manning, 2019;Goldberg, 2019) and are multi-skilled (Pande et al, 2021), making them crucial for prediction. In contrast, the initial layers are responsible for phraselevel understanding while the last few layers are highly task-specific (Jawahar et al, 2019).…”
Section: Features From Flipping Heads In Ias F Flipmentioning
confidence: 99%
“…When comparing across datasets, we observe that AdvNet performs better on simpler sentence labelling datasets like SST-2 and AG News when compared to more complex tasks like RTE and MRPC which require comparison between sentences. Existing work (Pande et al, 2021) shows that for simpler tasks, the BERT heads perform discrete non-overlapping roles, while for complex tasks, there is greater overlap in head roles and a few heads perform more than one role. We hypothesize that this nature implies that the attention masks for different inputs even belonging to the same type (authentic or adversarial) can vary widely.…”
Section: Performance On Adversarial Detectionmentioning
confidence: 99%
See 1 more Smart Citation