Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1511
|View full text |Cite
|
Sign up to set email alerts
|

Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks

Abstract: Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regulariz… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
37
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 51 publications
(37 citation statements)
references
References 10 publications
0
37
0
Order By: Relevance
“…Self-supervised pre-training for speech In speech, wav2vec (Schneider et al, 2019) leverages contrastive learning to produce contextual representations for audio input; vq-wav2vec (Baevski et al, 2020a) and wav2vec 2.0 (Baevski et al, 2020b) further propose to discretize the original continuous audio signals in order to enable more efficient MLM training with Transformer (Vaswani et al, 2017). Pre-trained speech models have been applied to ASR Baevski et al, 2020b), phoneme recognition (Song et al, 2020;Liu et al, 2020a), speech translation (Nguyen et al, 2020;Chung et al, 2019c), and speech synthesis (Chung et al, 2019b), to name a few.…”
Section: Related Workmentioning
confidence: 99%
“…Self-supervised pre-training for speech In speech, wav2vec (Schneider et al, 2019) leverages contrastive learning to produce contextual representations for audio input; vq-wav2vec (Baevski et al, 2020a) and wav2vec 2.0 (Baevski et al, 2020b) further propose to discretize the original continuous audio signals in order to enable more efficient MLM training with Transformer (Vaswani et al, 2017). Pre-trained speech models have been applied to ASR Baevski et al, 2020b), phoneme recognition (Song et al, 2020;Liu et al, 2020a), speech translation (Nguyen et al, 2020;Chung et al, 2019c), and speech synthesis (Chung et al, 2019b), to name a few.…”
Section: Related Workmentioning
confidence: 99%
“…More recent work has explored incorporating contextual information in the pre-training stage, and model the use of frames in context of the entire input sequence. The pre-training objectives, usually using self-supervised learning, include next step prediction [7,8], masked acoustic modeling [9,10,11], and connectionist temporal classification [12]. Pre-trained contextualized acoustic representations appear to be extremely effective.…”
Section: Introductionmentioning
confidence: 99%
“…Reference [10,11] proposed an Autoregressive Predictive Coding (APC) objective that predicts unseen future frames based on past frames, which yielded satisfactory results in phonetic classification, speech recognition, and speech translation. Other studies [12,13,14,15,16,17] were motivated by NLP and applied similar methods on speech tasks. Among these methods, Masked Predictive Coding (MPC) [15] realized significant improvements on state-of-the-art transformerbased speech recognition models on various datasets without introducing any additional parameters into the speech recognition model.…”
Section: Introductionmentioning
confidence: 99%