“…Self-supervised pre-training for speech In speech, wav2vec (Schneider et al, 2019) leverages contrastive learning to produce contextual representations for audio input; vq-wav2vec (Baevski et al, 2020a) and wav2vec 2.0 (Baevski et al, 2020b) further propose to discretize the original continuous audio signals in order to enable more efficient MLM training with Transformer (Vaswani et al, 2017). Pre-trained speech models have been applied to ASR Baevski et al, 2020b), phoneme recognition (Song et al, 2020;Liu et al, 2020a), speech translation (Nguyen et al, 2020;Chung et al, 2019c), and speech synthesis (Chung et al, 2019b), to name a few.…”