Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Miao, Haoran; Cheng, Gaofeng; Gao, Changfeng; Zhang, Pengyuan; Yan, Yonghong

doi:10.48550/arxiv.2001.08290

Cited by 3 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To apply MPC in streaming models, the Transformer encoder needs to be restricted to only use information that has appeared before. Though some previous work [26,27] employed chunkwise splitting for streaming models, in this paper, we simply changed self-attention mask on Transformer encoder to make the whole model stream-able. Specifically, we use a triangular matrix for self-attention mask M in encoder, where the upper triangular part is set to −∞, and the other elements to 0.…”

Section: Mpc For Streaming Modelsmentioning

confidence: 99%

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

Jiang¹,

Li²,

Zhang³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The construction of an effective good speech recognition system typically requires large amounts of transcribed data, which is expensive to collect. To overcome this problem, many unsupervised pretraining methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and transformer backbone. However, many aspects of MPC have yet to be fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pretraining data speaking style, its extension on streaming model, and strategies for better transferring learned knowledge from pretraining stage to downstream tasks. The experimental results demonstrated that pretraining data with a matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided an 8.46% relative error reduction on the streaming model trained on HKUST. Additionally, the combination of target data adaption and layerwise discriminative training facilitated the knowledge transfer of MPC, which realized 3.99% relative error reduction on AISHELL over a strong baseline.

show abstract

Section: Mpc For Streaming Modelsmentioning

confidence: 99%

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

Jiang¹,

Li²,

Zhang³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In (18), u i,k is the pre-softmax activations. In (19), w denotes the chunk width and α i,k denotes the attention weight within the chunk.…”

Section: B Monotonic Chunk-wise Attention (Mocha)mentioning

confidence: 99%

“…Although the hybrid CTC/attention end-to-end ASR architecture is reaching reasonable performance [16]- [19], how to deploy it in online scenarios remains an unsolved problem. After inspections of the CTC/attention ASR architecture, we identify four challenges to deploy online hybrid CTC/attention end-to-end ASR systems:…”

Section: Introductionmentioning

confidence: 99%

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Miao¹,

Cheng²,

Zhang³

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Recently, there has been increasing progress in endto-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN) / hidden Markov model (HMM) ASR systems. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This paper describes our proposed online hybrid CTC/attention endto-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-scale online solution for CTC/attention end-to-end ASR architecture.

show abstract

“…As for language modelling, Transformerbased architectures have achieved very promising results [9,10], though Long-Short Term Memory (LSTM) recurrent neural networks (RNN) are still in broad use [11]. Apart from hybrid systems, end-to-end systems have received great attention in recent years, including a number of proposals for low-latency streaming decoding [12][13][14]. However, despite their simplicity and promising prospects, it is still unclear whether or not they will soon surpass state-of-the-art hybrid systems combining independent models trained from vast amounts of data.…”

Section: Introductionmentioning

confidence: 99%

MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

et al. 2022

View full text Add to dashboard Cite

This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.

show abstract

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Cited by 3 publications

References 18 publications

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

Contact Info

Product

Resources

About