Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.
This paper presents a method to pre-train transformer-based encoder-decoder automatic speech recognition (ASR) models using sufficient target-domain text. During pre-training, we train the transformer decoder as a conditional language model with empty or artifical states, rather than the real encoder states. By this pre-training strategy, the decoder can learn how to generate grammatical text sequence before learning how to generate correct transcriptions. Contrast to other methods which utilize text only data to improve the ASR performance, our method does not change the network architecture of the ASR model or introduce extra component like textto-speech (TTS) or text-to-encoder (TTE). Experimental results on LibriSpeech corpus show that the proposed method can relatively reduce the word error rate over 10%, using 960 hours transcriptions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.