2021
DOI: 10.48550/arxiv.2110.04590
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

Abstract: Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 23 publications
0
4
0
Order By: Relevance
“…The distributional shifts here may stem from both the acoustics (microphone, room reverberation) as well as lexical effects related to topic and style, as well as differences in speaker characteristics such as accent. Similar problems were also investigated using HuBERT and wav2vec 2.0 models in [280]. In [281] domain effects were studied in greater detail using datasets from six different domains.…”
Section: Robustness and Transferabilitymentioning
confidence: 99%
“…The distributional shifts here may stem from both the acoustics (microphone, room reverberation) as well as lexical effects related to topic and style, as well as differences in speaker characteristics such as accent. Similar problems were also investigated using HuBERT and wav2vec 2.0 models in [280]. In [281] domain effects were studied in greater detail using datasets from six different domains.…”
Section: Robustness and Transferabilitymentioning
confidence: 99%
“…The distributional shifts here may stem from both the acoustics (microphone, room reverberation) as well as lexical effects related to topic and style, as well as differences in speaker characteristics such as accent. Similar problems were also investigated using HuBERT and wav2vec 2.0 models in [261]. In [262] domain effects were studied in greater detail using datasets from six different domains.…”
Section: Robustness and Transferabilitymentioning
confidence: 99%
“…Recently, there are several works focusing on utilizing the pre-trained speech representations on different speech-related tasks, including speech recognition [27,28] and speech enhancement [29]. Compared with the traditional handcrafted features, including pause, duration, fundamental frequency (F0), and Mel Frequency Cepstral Coefficients (MFCCs), a variety of pre-trained representations obtain significant improvements in the task of speech recognition [27]. Specifically, the recent Wav2Vec2.0 [28] is a transformer-based speech framework that is trained by predicting speech units for masked parts of the audio.…”
Section: Representation Learning On Speechmentioning
confidence: 99%