Recent Advances in End-to-End Automatic Speech Recognition

Li, Jinyu

doi:10.48550/arxiv.2111.01690

Cited by 14 publications

(11 citation statements)

References 235 publications

(307 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Jinyu Li [28] gave a detailed overview of E2E models and feasible technologies that makes E2E models to outperform hybrid models in the industry world.…”

Section: B Deep Learning Based Methods For Automatic Speech Recogniti...mentioning

confidence: 99%

Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

2022

View full text Add to dashboard Cite

Automatic speech recognition (ASR) is one of the utmost demanding tasks in Natural Language Processing due to its complexity. Recently, deep learning approaches have been deployed for this task, and have been proven to outperform traditional machine learning approaches such as ANN. Particularly; deep learning methods such as Long Short-Term Memory (LSTM) has achieved improved performance in ASR. However, this method is limited in processing continuous input streams. Traditional LSTM requires 4 linear layers (MLP layer) per cell, which require large amounts of memory bandwidth to run at and for each sequence time-step. LSTM cannot afford many computational units required in processing continuous input streams because the system does not have enough memory bandwidth to feed the computational units. In this research, an enhanced deep learning LSTM RNN model is proposed to resolve this shortcoming. In the proposed model, a Recurrent Neural Network (RNN) is incorporated as a "forget gate" to the memory block to allow resetting of the cell states at the beginning of sub-sequences. This will enable the system to efficiently process continuous input streams without necessarily increasing the required bandwidths. In the proposed model, the standard architecture of the LSTM networks has been modified to make effective use of the model parameters to address the computational efficiency problems of large networks on large vocabulary speech recognition. Some CNN based models and Sequential models were also used on the same dataset, and the performances of the models were compared with the performance of the proposed model. The LSTM-RNN outperformed the other deep learning models with the accuracy of 99.36% on the well-established public benchmark spoken English digits dataset.

show abstract

“…Jinyu Li [28] gave a detailed overview of E2E models and feasible technologies that makes E2E models to outperform hybrid models in the industry world.…”

Section: B Deep Learning Based Methods For Automatic Speech Recogniti...mentioning

confidence: 99%

Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

2022

View full text Add to dashboard Cite

show abstract

“…ASR [16]. Based on HuBERT encoder, our proposed Speech2C model can also pre-train a Transformer decoder with pseudo label from the clustering model.…”

Section: Related Workmentioning

confidence: 99%

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

Zhang¹,

Zhang²,

Liu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoderdecoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct pseudo codes autoregressively instead of generating textual scripts. In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text. Comprehensive experiments on the LibriSpeech corpus show that the proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training, and also outperforms significantly the state-of-the-art wav2vec 2.0 and HuBERT on finetuning subsets of 10h and 100h.

show abstract

“…With the development of deep learning, end-to-end neural approaches have rapidly gained prominence in the speech recognition community [25]. However, ASR in complicated scenarios such as meetings is still not a solved problem with challenges including complex acoustic conditions, unknown number of speakers and overlapping speech.…”

Section: Related Workmentioning

confidence: 99%

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Yang¹,

Zhang²,

Guo³

et al. 2022

Preprint

View full text Add to dashboard Cite

The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions.

show abstract

Recent Advances in End-to-End Automatic Speech Recognition

Cited by 14 publications

References 235 publications

Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Contact Info

Product

Resources

About