Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1337
|View full text |Cite
|
Sign up to set email alerts
|

Semi-Supervised Learning with Data Augmentation for End-to-End ASR

Abstract: In this paper, we apply Semi-Supervised Learning (SSL) along with Data Augmentation (DA) for improving the accuracy of End-to-End ASR. We focus on the consistency regularization principle, which has been successfully applied to image classification tasks, and present sequence-to-sequence (seq2seq) versions of the FixMatch and Noisy Student algorithms. Specifically, we generate the pseudo labels for the unlabeled data onthe-fly with a seq2seq model after perturbing the input features with DA. We also propose so… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 23 publications
(14 citation statements)
references
References 35 publications
0
14
0
Order By: Relevance
“…We focus on self-training [21] or pseudo-labeling (PL) [22], which has recently been adopted for semi-supervised E2E ASR and shown to be effective [23][24][25][26][27][28][29][30][31][32]. In PL, a teacher (base) model is first trained on labeled data and used to generate * Research conducted during an internship at MERL pseudo-labels for unlabeled data.…”
Section: Introductionmentioning
confidence: 99%
“…We focus on self-training [21] or pseudo-labeling (PL) [22], which has recently been adopted for semi-supervised E2E ASR and shown to be effective [23][24][25][26][27][28][29][30][31][32]. In PL, a teacher (base) model is first trained on labeled data and used to generate * Research conducted during an internship at MERL pseudo-labels for unlabeled data.…”
Section: Introductionmentioning
confidence: 99%
“…[15,24]) and an encoder-decoder structure with attention (cf. [2,33]). The far-field ASR task is treated as a sequenceto-sequence learning problem: The model M is trained to predict a sequence of symbols yj (here, we use sub-word units) from the multi-channel complex spectrum X ∈ C T ×F ×C , where T is the number of frames, F is the number of frequency bins, and C is the number of channels in a input utterance.…”
Section: End-to-end Multi-channel Asrmentioning
confidence: 99%
“…The decoder is composed of 2 LSTM layers with size 1024, and the dropout rates are set to 0.1 and 0.4 for the first and second layer respectively. The training recipe is similar to [33]. SA is applied with Fmax = 15, mF = 2 in the ASR feature domain (80-dimensional log-Mel features).…”
Section: Test Datamentioning
confidence: 99%
“…A student model is then trained on the augmented training data including both labeled and pseudo-parallel data to obtain a model that is expected to generalize better to the target domain. ST has recently shown excellent performance for neural sequence generation tasks such as machine translation [15] and ASR [16][17][18], achieving stateof-the-art performance for semi-supervised ASR when applied in an iterative manner [19]. Classical works in ST [20][21][22] suggest that its performance is not stable if the generated pseudo-labels are highly erroneous, and hence ST is often accompanied by a filtering process to remove such pseudo-labeled utterances from the training data.…”
Section: Introductionmentioning
confidence: 99%