ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9052940
|View full text |Cite
|
Sign up to set email alerts
|

Sequence-Level Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 15 publications
0
13
0
Order By: Relevance
“…where Lunsup is maximized via a gradient descent optimization. Note that in (4), we apply the data augmentation to an unlabeled input as in [24,26], aiming for the online model to learn robust prediction of pseudo-labels from the noisy input. In Sec.…”
Section: Semi-supervised Training With Mplmentioning
confidence: 99%
See 1 more Smart Citation
“…where Lunsup is maximized via a gradient descent optimization. Note that in (4), we apply the data augmentation to an unlabeled input as in [24,26], aiming for the online model to learn robust prediction of pseudo-labels from the noisy input. In Sec.…”
Section: Semi-supervised Training With Mplmentioning
confidence: 99%
“…We focus on self-training [21] or pseudo-labeling (PL) [22], which has recently been adopted for semi-supervised E2E ASR and shown to be effective [23][24][25][26][27][28][29][30][31][32]. In PL, a teacher (base) model is first trained on labeled data and used to generate * Research conducted during an internship at MERL pseudo-labels for unlabeled data.…”
Section: Introductionmentioning
confidence: 99%
“…Following the baseline system [9], the encoder genc consists of five 1d-convolutional layers with kernel sizes of (10,8,4,4,4) and stride sizes of (5, 4, 2, 2, 2). The downsampling factor of genc is 160 and the embedding z has a sampling rate of 100Hz.…”
Section: Methodsmentioning
confidence: 99%
“…Many studies have shown that textual information is essential for building speech recognition systems and language models (LM). Recently, several important studies on representation learning [1,2,3,4,5] and semi-supervised training [6,7,8] explored using a large amount of speech data without corresponding text annotations and demonstrated significant improvements in speech recognition performance. This suggests that such systems may learn to train their own LM from raw audio only.…”
Section: Introductionmentioning
confidence: 99%
“…A recent work [13] on semi-supervised sequence-to-sequence ASR has applied consistency training and has shown effectiveness with unlabeled speech data. Our previous work called ASR-TTS [4] used cycle-consistency training with REINFORCE and showed gains on standard speech datasets.…”
Section: Introductionmentioning
confidence: 99%