2020
DOI: 10.48550/arxiv.2010.10504
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pretrained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
130
1
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 84 publications
(136 citation statements)
references
References 47 publications
4
130
1
1
Order By: Relevance
“…By replacing the underlying long short-term memory (LSTM) [25] with Transformer [26] in the encoder, which allows a more powerful attention mechanism to be used, CTC thrives again in recent studies [27]. It gets further boosted by the emerged self-supervised learning technologies [28][29][30][31] which can learn a very good representation that carries semantic information.…”
Section: A) Connectionist Temporal Classificationmentioning
confidence: 99%
See 1 more Smart Citation
“…By replacing the underlying long short-term memory (LSTM) [25] with Transformer [26] in the encoder, which allows a more powerful attention mechanism to be used, CTC thrives again in recent studies [27]. It gets further boosted by the emerged self-supervised learning technologies [28][29][30][31] which can learn a very good representation that carries semantic information.…”
Section: A) Connectionist Temporal Classificationmentioning
confidence: 99%
“…SSL is even more powerful because it does not need any labeled data for pre-training, naturally solving the low-resource challenge. Therefore, SSL is becoming a new trend which especially works very well for ASR on resource limited languages [28][29][30][31][278][279][280][281], with representative technologies such as wav2vec 2.0 [28], autoregressive predictive coding [279], and HuBERT [31]. While most SSL studies focus on very limited supervised training data (e.g., 1000 hours), there are also recent studies showing promising results on industry-scale tens of thousand hours supervised training data [282,283].…”
Section: Miscellaneous Topicsmentioning
confidence: 99%
“…512-point FFT is used to extract 257-dimensional LPS. 6 pairs of microphones are selected for IPD and TPD computation, which are (0, 7), (1,6), (2,5), (3,4), (4, 7), (3,4). The total dimension of the input feature after concatenation is 257 × (1 + 6 + 1) = 2056.…”
Section: Separation Modulementioning
confidence: 99%
“…With the development of speech techniques and deep neural networks, dramatic improvement has been achieved on multiple automatic speech recognition (ASR) benchmarks [1,2,3,4]. However, it remains a challenging task for multi-channel multi-speaker overlapped speech recognition due to the interfering speakers or background noise [5,6,7].…”
Section: Introductionmentioning
confidence: 99%
“…Recent developments in automatic speech recognition (ASR) for spoken languages [13,14,65,70] Text-based sign language video retrieval: In this work we introduce sign language video retrieval with free-form textual queries, the task of searching collections of sign language videos to find the best match for a free-form textual query, going beyond single keyword search.…”
Section: Introductionmentioning
confidence: 99%