Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.333
|View full text |Cite
|
Sign up to set email alerts
|

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Abstract: In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BE… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 16 publications
0
7
0
Order By: Relevance
“…Relative attention uses the relative position representations of Shaw et al (2018)'s approach but without the query-key dot product of multihead attention. Thus, relative attention differs from rPosNet in that content information is provided to the queries and is equal to dynamic convolutions with global context (Chang et al, 2021). In total, the x-axis of Figure 2 depicts position-position interactions for aPosNet and rPos-Net, token-position interactions for relative attention, token-token interactions for the Transformer, and token-token + token-position interactions for Shaw et al (2018).…”
Section: The Impact Of Gating and Query-key Informationmentioning
confidence: 99%
“…Relative attention uses the relative position representations of Shaw et al (2018)'s approach but without the query-key dot product of multihead attention. Thus, relative attention differs from rPosNet in that content information is provided to the queries and is equal to dynamic convolutions with global context (Chang et al, 2021). In total, the x-axis of Figure 2 depicts position-position interactions for aPosNet and rPos-Net, token-position interactions for relative attention, token-token interactions for the Transformer, and token-token + token-position interactions for Shaw et al (2018).…”
Section: The Impact Of Gating and Query-key Informationmentioning
confidence: 99%
“…In the context of Transformers, countless works proposed ways to include some form of position-based attention bias (Shaw et al, 2018;Yang et al, 2018;Dai et al, 2019;Wang et al, 2020;Ke et al, 2021;Su et al, 2021;Luo et al, 2021;Qu et al, 2021;Chang et al, 2021;Wu et al, 2021;Wennberg & Henter, 2021;Likhomanenko et al, 2021;Dufter et al, 2022;Luo et al, 2022;Sun et al, 2022) (interalia). Dynamic convolution (Wu et al, 2019) and other similar models can also be Table 9.…”
Section: F Connections To Other Modelsmentioning
confidence: 99%
“…Recently, Pre-trained Language Models (PLMs) improve various downstream NLP tasks significantly (He et al 2020;Xu et al 2021;Chang et al 2021). In PLMs, the two-stage strategy (i.e., pre-training and fine-tuning) (Devlin et al 2019) inherits the knowledge learned during pre-training and applies it to downstream tasks.…”
Section: Introductionmentioning
confidence: 99%