Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.236
|View full text |Cite
|
Sign up to set email alerts
|

A Simple and Effective Positional Encoding for Transformers

Abstract: Transformer models are permutation equivariant. To supply the order and type information of the input tokens, position and segment embeddings are usually added to the input. Recent works proposed variations of positional encodings with relative position encodings achieving better performance. Our analysis shows that the gain actually comes from moving positional information to attention layer from the input. Motivated by this, we introduce Decoupled posItional attEntion for Transformers (DIET), a simple yet ef… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(8 citation statements)
references
References 12 publications
0
8
0
Order By: Relevance
“…Because each word matches sine and cosine curves of different periods using the transformation equation of the trigonometric function, different positions obtain unique positional encoding. In addition, the latest research reports on advanced positional encoding, such as Decoupled posItional attEntion for Transformers (DIET) [54] and Position Encoding Generator (PEG) [55]…”
Section: Basic Architecture Of Transformersmentioning
confidence: 99%
“…Because each word matches sine and cosine curves of different periods using the transformation equation of the trigonometric function, different positions obtain unique positional encoding. In addition, the latest research reports on advanced positional encoding, such as Decoupled posItional attEntion for Transformers (DIET) [54] and Position Encoding Generator (PEG) [55]…”
Section: Basic Architecture Of Transformersmentioning
confidence: 99%
“…To address this, the first Transformer models added an "absolute" positional encoding to the model input which communicates the order of the input sequence [7]. Since then, alternative "relative" positional encoding methods have been proposed for sequential inputs which inject a bias term at different locations within the Transformer architecture [8], [9]. The aim of these strategies is the same: communicate the structure of the input data to the Transformer.…”
Section: Capturing Graph Structurementioning
confidence: 99%
“…However, Pu-Chin et al found that absolute positional encodings underperform relative positional encodings and suffer from limitations in the rank of the resulting attention matrices [9]. As a result, we focus our work exclusively on relative positional encoding strategies.…”
Section: Capturing Graph Structurementioning
confidence: 99%
“…Therefore, we investigate whether the performance improvement of our method truly stems from the utterance dependencies or just because it segments the conversation. We compare ReDE with the segment encoding method introduced by Chen et al [17]. To make a fair comparison, the numbers of parameters in the compared methods are kept the same.…”
Section: Ablation Studymentioning
confidence: 99%
“…For example, the question in Case #1 is "What is the other way ...", baseline finds a wrong answer "ls FILEPATH or ls FILEPATH" in the second utterance due to the phrase "Many ways" can strongly match the "other way" in the question, and the "...or..." pattern to some extent has the meaning of "the other way". In Case #2 the question is "How does ... use the different icon", baseline finds the wrong answer with the close pattern "get ... to use a different icon", however it fails to consider the true meaning of the selected RoBERTa seg denotes the segment encoding method [17].…”
Section: Case Studymentioning
confidence: 99%