2018
DOI: 10.48550/arxiv.1803.02155
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Attention with Relative Position Representations

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
414
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 284 publications
(415 citation statements)
references
References 0 publications
1
414
0
Order By: Relevance
“…In apart from the network architecture modification, several advanced tricks for vision transformer are also adopted. Relative position encoding [25] is added on self-attention module to better represent relative position between tokens. Linear spatial reduction attention (LSRA) [33] is utilized in the first two stages to reduce the computation cost of self-attention for long sequence.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In apart from the network architecture modification, several advanced tricks for vision transformer are also adopted. Relative position encoding [25] is added on self-attention module to better represent relative position between tokens. Linear spatial reduction attention (LSRA) [33] is utilized in the first two stages to reduce the computation cost of self-attention for long sequence.…”
Section: Methodsmentioning
confidence: 99%
“…Inspired by the recent works [34,36], we introduce two main architecture modifications: 1) pyramid architecture with gradual decreased resolution to extract multi-scale representa- tions, and 2) convolutional stem for improving the patchify stem and stable training. We also include several other tricks [25,33] to further improve the efficiency. The new transformer is named as PyramidTNT.…”
Section: Introductionmentioning
confidence: 99%
“…Because the added position embedding depends on the absolute positions of tokens in a sequence, it is called absolute position encoding. We'll use relative position encoding [12], which can directly encode the distance between tokens. To increase the accuracy of the short-term load forecast, we modified the similarity functions (Equation (1) and Equation ( 5)) as in…”
Section: Relative Position Encoding For Transformermentioning
confidence: 99%
“…This phenomenon could result from the inductive bias of CNNs, namely spatial similarity. In the future, we could add other popular modules to our model, such as absolute positional encoding [7], relative positional encoding [62], and conditional positional encoding [63], which could further improve the performance. We could also pre-train our model for few-shot or zero-shot learning, where only a few supervised labels are required in training.…”
Section: Time Seriesmentioning
confidence: 99%