ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414310
|View full text |Cite
|
Sign up to set email alerts
|

On Scaling Contrastive Representations for Low-Resource Speech Recognition

Abstract: Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof-the-art speech recognizer on the fixed representations from the computationally demanding wav2… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 16 publications
0
1
0
Order By: Relevance
“…During fine-tuning, we only update the parameters of the Transformer based context network following the wav2vec2.0 [5]. Fine-tuning wav2vec2.0 on labeled data with CTC objective [16] has been well verified [30,31,32]. However, according to the work [6], the results of fine-tuning wav2vec2.0 based on a vanilla Transformer S2S ASR model with cross-entropy criterion can only achieve a very limited result.…”
Section: Encoder (W2v-encoder)mentioning
confidence: 99%
“…During fine-tuning, we only update the parameters of the Transformer based context network following the wav2vec2.0 [5]. Fine-tuning wav2vec2.0 on labeled data with CTC objective [16] has been well verified [30,31,32]. However, according to the work [6], the results of fine-tuning wav2vec2.0 based on a vanilla Transformer S2S ASR model with cross-entropy criterion can only achieve a very limited result.…”
Section: Encoder (W2v-encoder)mentioning
confidence: 99%