Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Li, Jinyu; Zhao, Rui; Meng, Ziyang; Liu, Yanqing; Wei, Wenning; Parthasarathy, S.; Mazalov, Vadim; Wang, Zhenghao; He, Lei; Zhao, Shenghui; Gong, Yifan

doi:10.21437/interspeech.2020-3016

Cited by 84 publications

(52 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another approach is to employ TTS to generate synthetic audio based on text data from the target domain. The synthetic audio-text pairs can be used to adapt an E2E model [8,9] to the target domain, or to train a spelling correction model [9,10].…”

Section: Related Workmentioning

confidence: 99%

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems

Zheng

Liu

Gunceler

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Today, many state-of-the-art automatic speech recognition (ASR) systems apply all-neural models that map audio to word sequences trained end-to-end along one global optimisation criterion in a fully data driven fashion. These models allow high precision ASR for domains and words represented in the training material but have difficulties recognising words that are rarely or not at all represented during training, i.e. trending words and new named entities. In this paper, we use a text-to-speech (TTS) engine to provide synthetic audio for out-of-vocabulary (OOV) words. We aim to boost the recognition accuracy of a recurrent neural network transducer (RNN-T) on OOV words by using the extra audio-text pairs, while maintaining the performance on the non-OOV words. Different regularisation techniques are explored and the best performance is achieved by fine-tuning the RNN-T on both original training data and extra synthetic data with elastic weight consolidation (EWC) applied on the encoder. This yields a 57% relative word error rate (WER) reduction on utterances containing OOV words without any degradation on the whole test set.

show abstract

Section: Related Workmentioning

confidence: 99%

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems

Zheng

Liu

Gunceler

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The entire system is jointly optimized with the single Transducer objective function, which is a modified CTC loss. Recently, the transducer approach has proven its effectiveness both in large-resource (e.g., [ 32 ]) and low-resource (e.g., [ 33 ]) tasks.…”

Section: Asr Modelingmentioning

confidence: 99%

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Laptev

Andrusenko

Podluzhny

et al. 2021

Sensors

View full text Add to dashboard Cite

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

show abstract

“…Great progress has been made to automatic speech recognition (ASR) with end-to-end (E2E) models [1,2,3,4,5,6,7,8,9]. Currently, Transducer (e.g., recurrent neural network Transducer (RNN-T) [10]) and Attention-based Encoder-Decoder (AED) [1,11,12] are two most popular types of E2E methods.…”

Section: Introductionmentioning

confidence: 99%

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Xie

Wang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

129

View full text Add to dashboard Cite

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

show abstract

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Cited by 84 publications

References 36 publications

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Contact Info

Product

Resources

About