Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-3016
|View full text |Cite
|
Sign up to set email alerts
|

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Abstract: Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
52
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 84 publications
(52 citation statements)
references
References 36 publications
0
52
0
Order By: Relevance
“…Another approach is to employ TTS to generate synthetic audio based on text data from the target domain. The synthetic audio-text pairs can be used to adapt an E2E model [8,9] to the target domain, or to train a spelling correction model [9,10].…”
Section: Related Workmentioning
confidence: 99%
“…Another approach is to employ TTS to generate synthetic audio based on text data from the target domain. The synthetic audio-text pairs can be used to adapt an E2E model [8,9] to the target domain, or to train a spelling correction model [9,10].…”
Section: Related Workmentioning
confidence: 99%
“…The entire system is jointly optimized with the single Transducer objective function, which is a modified CTC loss. Recently, the transducer approach has proven its effectiveness both in large-resource (e.g., [ 32 ]) and low-resource (e.g., [ 33 ]) tasks.…”
Section: Asr Modelingmentioning
confidence: 99%
“…Great progress has been made to automatic speech recognition (ASR) with end-to-end (E2E) models [1,2,3,4,5,6,7,8,9]. Currently, Transducer (e.g., recurrent neural network Transducer (RNN-T) [10]) and Attention-based Encoder-Decoder (AED) [1,11,12] are two most popular types of E2E methods.…”
Section: Introductionmentioning
confidence: 99%