ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414539
|View full text |Cite
|
Sign up to set email alerts
|

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

Abstract: The construction of an effective good speech recognition system typically requires large amounts of transcribed data, which is expensive to collect. To overcome this problem, many unsupervised pretraining methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and transformer backbone. However, many aspects of MPC have yet to be fully investigated. In this paper, we conduct a fur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(33 citation statements)
references
References 42 publications
0
33
0
Order By: Relevance
“…Unfortunately, the highly correlated nature of neighbouring samples in EEG (or most other continuous data for that matter), is not conducive to this approach. The likely result would be that, instead of an EM, a method for interpolation would be learned, as has been argued in similar work in self-supervised learning with speech [41]. In other words, the smoothness of these data would make it hard to produce general features simply through recovering missing points.…”
Section: Pre-training With Dnnsmentioning
confidence: 99%
See 1 more Smart Citation
“…Unfortunately, the highly correlated nature of neighbouring samples in EEG (or most other continuous data for that matter), is not conducive to this approach. The likely result would be that, instead of an EM, a method for interpolation would be learned, as has been argued in similar work in self-supervised learning with speech [41]. In other words, the smoothness of these data would make it hard to produce general features simply through recovering missing points.…”
Section: Pre-training With Dnnsmentioning
confidence: 99%
“…Prior work in self-supervised speech recognition has begun to synthesize parts of CPC and MLM to produce methodologies for self-learning with raw waveforms [13,44,45,41,31]. In our work, we adapt one of these approaches called wav2vec 2.0 [13] (its particular formulation is detailed in section 2.4.1) to EEG, and investigate how effective the representations (BENDR) are for downstream tasks.…”
Section: Pre-training With Dnnsmentioning
confidence: 99%
“…For ASR, several open-source read speech corpora, such as HKUST and AISHELL-1, have also been released. The scale of these databases is significantly smaller than the ASR datasets arXiv:2010.09275v3 [eess.AS] 23 Oct 2020 used in companies [24,25], which may lead to a divergence between research and industry.…”
Section: Related Workmentioning
confidence: 99%
“…We use characters as labels to train and evaluate all ASR models. Recently, Transformer-based models have achieved competitive performance on HKUST, AISHELL-1 and Lib-riSpeech [24,25,31]. In the ASR experiments, we follow model structure of previous work [25] with e = 12, d = 6, d model = 512, d f f = 1280 and d head = 8.…”
Section: Asrmentioning
confidence: 99%
“…Self-supervised learning has been shown effective in natural language processing (NLP), e.g., BERT [1] and BART [2], where it makes use of a large amount of unlabeled data for pretraining to improve the performance of downstream tasks, such as automatic speech recognition (ASR) [3,4,5,6,7,8].…”
Section: Introductionmentioning
confidence: 99%