2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639610
|View full text |Cite
|
Sign up to set email alerts
|

Toward Domain-Invariant Speech Recognition via Large Scale Training

Abstract: Current state-of-the-art automatic speech recognition systems are trained to work in specific 'domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
66
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 94 publications
(67 citation statements)
references
References 30 publications
0
66
0
Order By: Relevance
“…For training, we use the same multidomain datasets as in [20,21] which include anonymized and hand-transcribed English utterances from general Google traffic, far-field environments, telephony conversations, and YouTube. We augment the clean training utterances by artificially corrupting them by using a room simulator, varying degrees of noise, and reverberation such that the signal-to-noise ratio (SNR) is between 0dB and 30dB [23].…”
Section: Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…For training, we use the same multidomain datasets as in [20,21] which include anonymized and hand-transcribed English utterances from general Google traffic, far-field environments, telephony conversations, and YouTube. We augment the clean training utterances by artificially corrupting them by using a room simulator, varying degrees of noise, and reverberation such that the signal-to-noise ratio (SNR) is between 0dB and 30dB [23].…”
Section: Datasetsmentioning
confidence: 99%
“…Our experiments are conducted using the same training data as in [20,21], which is from multiple domains such as Voice Search, YouTube, Farfield and Telephony. We first analyze the behavior of the deliberation model, including performance when attending to multiple RNN-T hypotheses, contribution of different attention, and rescoring vs. beam search.…”
Section: Introductionmentioning
confidence: 99%
“…Another important aspect in building high-performance speech recognition systems is the amount and the coverage of the training data. To build high performance speech recognition systems for conversational speech, we need to use a large amount of speech data covering various domains [17]. In [18], it has been shown that we need a very large training set (∼125,000 hours of semi-supervised speech data) to achieve high speech recognition accuracy for difficult tasks like video captioning.…”
Section: Introductionmentioning
confidence: 99%
“…In the era of deep neural networks, it has been frequently observed that the amount and coverage of the training data seem to be one of the most important factors to obtain better speech recognition accuracy [12,13]. However, it is very difficult to gather sufficient amount of transcribed data from various domains.…”
Section: Introductionmentioning
confidence: 99%