Relative Positional Encoding for Speech Recognition and Direct Translation

Pham, Ngoc-Quan; Ha, Thanh-Le; Nguyen, Tuan-Nam; Nguyen, Thai-Son; Salesky, Elizabeth; Stueker, Sebastian; Niehues, Jan; Waibel, Alexander

doi:10.48550/arxiv.2005.09940

Cited by 6 publications

(8 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relative positional encoding [12,13] is an extension of an absolute positional encoding technique that allows self-attention to handle relative positional information. The absolute positional encoding is defined as follows:…”

Section: Positional Encodingmentioning

confidence: 99%

“…To solve this problem, several studies have been proposed. Masking [11] limits the range of self-attention by using a Gaussian window, whereas relative positional encoding [12,13] uses relative embedding in a self-attention architecture to eliminate the effect of the length mismatch. However, masking does not take into account the correlation between input features and relative distance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Kashiwagi

Tsunoo

Watanabe

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.

show abstract

Section: Positional Encodingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Kashiwagi

Tsunoo

Watanabe

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…They all apply learned RPE in the attention domain. Using fixed embedding functions was also considered for RPE (Pham et al, 2020), and masking RPE is used in Kim et al (2020) to promote local attention.…”

Section: Related Workmentioning

confidence: 99%

“…The Transformer model (Vaswani et al, 2017) is a new kind of neural network that quickly became state-of-theart in many application domains, including the processing of natural language (He et al, 2020), images (Dosovitskiy et al, 2020), audio (Huang et al, 2018;Pham et al, 2020) or bioinformatics (AlQuraishi, 2019) regression (Nadaraya, 1964;Watson, 1964) and consists in a simple weighted sum:…”

Section: Introduction 1linear Complexity Transformersmentioning

confidence: 99%

Relative Positional Encoding for Transformers with Linear Complexity

Cífka¹,

Wu²,

Şimşekli³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

show abstract

“…Recently, relative positional embedding [8,9] uses relative embedding to solve the insufficient localness modeling problem for the embedding layers of the Transformer-based ASR. However, the relative embedding itself does not limit the attention to the neighborhood of the frame.…”

Section: Introductionmentioning

confidence: 99%

Transformer-Based End-to-End Speech Recognition with Residual Gaussian-Based Self-Attention

Liang¹,

Xu²,

Zhang³

2021

Interspeech 2021

View full text Add to dashboard Cite

Self-attention (SA), which encodes vector sequences according to their pairwise similarity, is widely used in speech recognition due to its strong context modeling ability. However, when applied to long sequence data, its accuracy is reduced. This is caused by the fact that its weighted average operator may lead to the dispersion of the attention distribution, which results in the relationship between adjacent signals ignored. To address this issue, in this paper, we introduce relativeposition-awareness self-attention (RPSA). It not only maintains the global-range dependency modeling ability of self-attention, but also improves the localness modeling ability. Because the local window length of the original RPSA is fixed and sensitive to different test data, here we propose Gaussian-based selfattention (GSA) whose window length is learnable and adaptive to the test data automatically. We further generalize GSA to a new residual Gaussian self-attention (resGSA) for the performance improvement. We apply RPSA, GSA, and resGSA to Transformer-based speech recognition respectively. Experimental results on the AISHELL-1 Mandarin speech recognition corpus demonstrate the effectiveness of the proposed methods. For example, the resGSA-Transformer achieves a character error rate (CER) of 5.86% on the test set, which is relative 7.8% lower than that of the SA-Transformer. Although the performance of the proposed resGSA-Transformer is only slightly better than that of the RPSA-Transformer, it does not have to tune the window length manually.

show abstract

Relative Positional Encoding for Speech Recognition and Direct Translation

Cited by 6 publications

References 25 publications

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Relative Positional Encoding for Transformers with Linear Complexity

Transformer-Based End-to-End Speech Recognition with Residual Gaussian-Based Self-Attention

Contact Info

Product

Resources

About