Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2623
|View full text |Cite
|
Sign up to set email alerts
|

Untranscribed Web Audio for Low Resource Speech Recognition

Abstract: Speech recognition models are highly susceptible to mismatch in the acoustic and language domains between the training and the evaluation data. For low resource languages, it is difficult to obtain transcribed speech for target domains, while untranscribed data can be collected with minimal effort. Recently, a method applying lattice-free maximum mutual information (LF-MMI) to untranscribed data has been found to be effective for semi-supervised training. However, weaker initial models and domain mismatch can … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
16
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 13 publications
(18 citation statements)
references
References 26 publications
2
16
0
Order By: Relevance
“…The original work was demonstrated on in-domain data-sets (conversational speech). Nevertheless, similar techniques have also been employed to use a seed model trained on out-of-domain data followed by SSL on in-domain data [1,17,18]. The latter is often obtained from data crawled from the web.…”
Section: Semi-supervised Training Using Lfmmimentioning
confidence: 99%
See 2 more Smart Citations
“…The original work was demonstrated on in-domain data-sets (conversational speech). Nevertheless, similar techniques have also been employed to use a seed model trained on out-of-domain data followed by SSL on in-domain data [1,17,18]. The latter is often obtained from data crawled from the web.…”
Section: Semi-supervised Training Using Lfmmimentioning
confidence: 99%
“…The acoustic model is then trained with the newly generated labels along with that of the supervised data. In this paper, we describe our efforts for the MATERIAL program 1 , where the training data consists of only conversational speech and the evaluation data consists of three genres: conversational speech, news and topical broadcast (CS, NB and TB, respectively). Moreover, unlike the training data, majority of test data belong to NB and TB.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This semi-supervised approach to DNN training has been successfully used eg. [9,10,11]. However, the technique requires careful confidence-based data selection, and is very sensitive to the performance of the source system on the target data.…”
Section: Introductionmentioning
confidence: 99%
“…LF-MMI trained AMs have also been effective in transfer learning based domain adaptation, wherein an ASR model trained on a large out-of-domain dataset is adapted in a supervised [7] or semi-supervised way [8] to a smaller in-domain dataset. Semisupervised LF-MMI based training has also been used for lowresourced languages [9]. Yet, these works have typically relied on hundreds of hours of in-domain data.…”
Section: Introductionmentioning
confidence: 99%