ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746719
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale ASR Domain Adaptation Using Self- and Semi-Supervised Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(8 citation statements)
references
References 12 publications
0
8
0
Order By: Relevance
“…Method Model Adaptation Setting Language [23], [25], [26] Teacher-Student Hard and soft labels Conformer RNN-T [27] Transformer CTC RNN-T [19] News speech, Voice search, Far-field, Telephony, YouTube English [4], [5] Teacher-Student Soft labels TDNN-LSTM [28] Noise, Far-field English [29] Teacher-Student Hard and soft labels NiN-CNN [30] Dialects Children speech Japanese [31] Teacher-Student Soft labels Streaming RNN-T [32] Multilingual English, Brazilian Portuguese, Russian, , Turkish, Nordic/Germanic [6], [33], [34] Domain Adversarial Training TDNN Kaldi [35], [36] DNN-HMM DNN-HMM Noise, Channel English [37] Domain Adversarial Training RNN-CTC [38] Far-field English [8], [39 1) Inspired by recent advances on UDA for Natural Language Processing systems [45], we propose a finetuning strategy for speech models, where the self-supervised objective is based on a contrastive loss in Section III. Contrary to prior works, who leverage only in-domain self-supervision, we find that in this contrastive setting this leads to mode-collapse of the latent representations, and mixed source and target domain self-supervision is essential.…”
Section: Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Method Model Adaptation Setting Language [23], [25], [26] Teacher-Student Hard and soft labels Conformer RNN-T [27] Transformer CTC RNN-T [19] News speech, Voice search, Far-field, Telephony, YouTube English [4], [5] Teacher-Student Soft labels TDNN-LSTM [28] Noise, Far-field English [29] Teacher-Student Hard and soft labels NiN-CNN [30] Dialects Children speech Japanese [31] Teacher-Student Soft labels Streaming RNN-T [32] Multilingual English, Brazilian Portuguese, Russian, , Turkish, Nordic/Germanic [6], [33], [34] Domain Adversarial Training TDNN Kaldi [35], [36] DNN-HMM DNN-HMM Noise, Channel English [37] Domain Adversarial Training RNN-CTC [38] Far-field English [8], [39 1) Inspired by recent advances on UDA for Natural Language Processing systems [45], we propose a finetuning strategy for speech models, where the self-supervised objective is based on a contrastive loss in Section III. Contrary to prior works, who leverage only in-domain self-supervision, we find that in this contrastive setting this leads to mode-collapse of the latent representations, and mixed source and target domain self-supervision is essential.…”
Section: Workmentioning
confidence: 99%
“…The agreement between model predictions with and without dropout are used for confidence scoring. In [23] a multi-task training objective with a confidence loss is applied to minimise the binary cross entropy between the estimated confidence and the binary target sequence. In order to learn more robust and generalizable features from the teacher model, Noisy Student Training (NST) has been proposed in [52].…”
Section: B Teacher-student Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…We assume that G is much larger than T . Confidence filtering is a classic method in data selection for ASR [28,29], and has been successfully applied to end-to-end models [12,30,31]. However, confidence methods focus on data of good transcript quality, which might not be useful for self-supervised pre-training.…”
Section: Data Selection For Self-supervised Learningmentioning
confidence: 99%
“…A foundation model [1] is usually a big model trained on broad data (generally using self-supervision at scale) that can be fine-tuned to a wide range of downstream tasks and has aroused extensive attention due to its impressive quality improvements and emergent capabilities [2,3,4,5]. In speech community, self-supervised pretraining speech foundation models on a large amount of unsupervised speech has shown impressive quality improvements on various speech recognition tasks [6,7]. There are two main categories of speech self-supervised learning algorithms.…”
Section: Introductionmentioning
confidence: 99%