ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053165
|View full text |Cite
|
Sign up to set email alerts
|

Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture

Abstract: Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
76
1
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 93 publications
(79 citation statements)
references
References 17 publications
0
76
1
2
Order By: Relevance
“…A similar Transformer architecture is adopted for all the three tasks. The online encoder is similar to the one presented in [22], which is a 6-layer chunk-SAE. The sizes of central, left and right We conduct CTC/attention joint training with the CTC weight of 0.3 for all tasks.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…A similar Transformer architecture is adopted for all the three tasks. The online encoder is similar to the one presented in [22], which is a 6-layer chunk-SAE. The sizes of central, left and right We conduct CTC/attention joint training with the CTC weight of 0.3 for all tasks.…”
Section: Methodsmentioning
confidence: 99%
“…( 1), the attention is computed upon the full sequence of encoder and/or decoder states as required by the softmax function, which poses a big challenge for online recognition. To stream the Transformer ASR system, chunk-hopping based strategies [8,20,21,22] have been applied on the encoder side, where the input utterance is spliced into overlapping chunks and the chunks are chronologically fed to the SAE. Thus, the latency of the online encoder is subject to the chunk size.…”
Section: Transformer-based Online Asr Systemmentioning
confidence: 99%
See 1 more Smart Citation
“…In terms of output labels, WSJ and AiShell-1 have 52 and 4231 classes, respectively. The online Transformer model adopts a chunkwise self-attention encoder (chunk-SAE) as presented in [25]. Non-overlapping chunks with length Nc are spliced from the original utterance so that they could be sequentially fed into the model.…”
Section: Methodsmentioning
confidence: 99%
“…Transformer-based E2E model is an encoder-decoder framework which has shown good performance for many ASR tasks [13,14,5]. Different from other RNN-based encoderdecoder model, transformer-based model uses multi-head attention (MHA) mechanism [15] to learn relationships between distant concepts, rather than relying on recurrent connections and memory cells.…”
Section: Transformer-based E2e Asrmentioning
confidence: 99%