2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003749
|View full text |Cite
|
Sign up to set email alerts
|

Transformer ASR with Contextual Block Processing

Abstract: The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-attention. In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
42
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 52 publications
(42 citation statements)
references
References 35 publications
0
42
0
Order By: Relevance
“…The authors of [23] also reported robust effects on the LibriSpeech benchmark with transformers within the hybrid system. Several works [24][25][26], as well as transformer transducers [9,27], have demonstrated the effectiveness of transformer for online speech recognition. The authors of [28] also employed an unsupervised pre-training method to enhance the transformer for ASR.…”
Section: Existing Work On Transformers For Asrmentioning
confidence: 99%
“…The authors of [23] also reported robust effects on the LibriSpeech benchmark with transformers within the hybrid system. Several works [24][25][26], as well as transformer transducers [9,27], have demonstrated the effectiveness of transformer for online speech recognition. The authors of [28] also employed an unsupervised pre-training method to enhance the transformer for ASR.…”
Section: Existing Work On Transformers For Asrmentioning
confidence: 99%
“…However, the global channel, speaker, and linguistic context are also important for local phoneme classification. Therefore, a context inheritance mechanism for block processing was proposed in [9,10] by introducing an additional context embedding vector in the encoder. Thus, the encoder sequentially computes encoded features h1:T b from the currently given b block input x1:T b .…”
Section: Streaming Encoder-decoder Asrmentioning
confidence: 99%
“…For interactive use cases in particular, streaming style inference is essential; thus, several approaches have been discovered for both the encoder-decoder (Enc-Dec) [1,2,3] and transducer models [4,5]. Blockwise processing can be easily introduced to the encoders of both models [6,7,8,9,10]. Although transducers are efficient for streaming ASR owing to framesynchronous decoding, they are less accurate than Enc-Dec [11] and Enc-Dec can be used additionally to achieve higher performance [12].…”
Section: Introductionmentioning
confidence: 99%
“…( 1), the attention is computed upon the full sequence of encoder and/or decoder states as required by the softmax function, which poses a big challenge for online recognition. To stream the Transformer ASR system, chunk-hopping based strategies [8,20,21,22] have been applied on the encoder side, where the input utterance is spliced into overlapping chunks and the chunks are chronologically fed to the SAE. Thus, the latency of the online encoder is subject to the chunk size.…”
Section: Transformer-based Online Asr Systemmentioning
confidence: 99%