2020
DOI: 10.48550/arxiv.2005.09940
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Relative Positional Encoding for Speech Recognition and Direct Translation

Abstract: Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 25 publications
0
8
0
Order By: Relevance
“…Relative positional encoding [12,13] is an extension of an absolute positional encoding technique that allows self-attention to handle relative positional information. The absolute positional encoding is defined as follows:…”
Section: Positional Encodingmentioning
confidence: 99%
See 1 more Smart Citation
“…Relative positional encoding [12,13] is an extension of an absolute positional encoding technique that allows self-attention to handle relative positional information. The absolute positional encoding is defined as follows:…”
Section: Positional Encodingmentioning
confidence: 99%
“…To solve this problem, several studies have been proposed. Masking [11] limits the range of self-attention by using a Gaussian window, whereas relative positional encoding [12,13] uses relative embedding in a self-attention architecture to eliminate the effect of the length mismatch. However, masking does not take into account the correlation between input features and relative distance.…”
Section: Introductionmentioning
confidence: 99%
“…They all apply learned RPE in the attention domain. Using fixed embedding functions was also considered for RPE (Pham et al, 2020), and masking RPE is used in Kim et al (2020) to promote local attention.…”
Section: Related Workmentioning
confidence: 99%
“…The Transformer model (Vaswani et al, 2017) is a new kind of neural network that quickly became state-of-theart in many application domains, including the processing of natural language (He et al, 2020), images (Dosovitskiy et al, 2020), audio (Huang et al, 2018;Pham et al, 2020) or bioinformatics (AlQuraishi, 2019) regression (Nadaraya, 1964;Watson, 1964) and consists in a simple weighted sum:…”
Section: Introduction 1linear Complexity Transformersmentioning
confidence: 99%
“…Recently, relative positional embedding [8,9] uses relative embedding to solve the insufficient localness modeling problem for the embedding layers of the Transformer-based ASR. However, the relative embedding itself does not limit the attention to the neighborhood of the frame.…”
Section: Introductionmentioning
confidence: 99%