2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383575
|View full text |Cite
|
Sign up to set email alerts
|

Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
52
0
1

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 126 publications
(54 citation statements)
references
References 15 publications
1
52
0
1
Order By: Relevance
“…The typical usage of this dataset is for ASR (Huang et al, 2020;Zhang et al, 2020). It could also be used for selfsupervised training (Chi et al, 2020;Liu et al, 2020), and transfer to the downstream task like phoneme classification, speaker recognition, and sentiment classification.…”
Section: Speech Datasetmentioning
confidence: 99%
“…The typical usage of this dataset is for ASR (Huang et al, 2020;Zhang et al, 2020). It could also be used for selfsupervised training (Chi et al, 2020;Liu et al, 2020), and transfer to the downstream task like phoneme classification, speaker recognition, and sentiment classification.…”
Section: Speech Datasetmentioning
confidence: 99%
“…The motivation for pretraining data with MulT is to capture and model temporal dependencies so we also want the base features to be temporally independent. Thus, even though features extracted from pretrained Speech Transformers such as [3,15,16] are powerful, they are not suitable to be base features for MulT.…”
Section: Feature Selectionmentioning
confidence: 99%
“…Such representations, computed by neural models trained on huge amounts of unlabeled data, have shown their effectiveness on some tasks under certain conditions, for instance in ASR [45], [46], or speech translation [47]. Recently Wav2Vec [48], Mockingjay [46] and Audio Al-BERT [49] were introduced in ASR and speaker identification as one of the first pre-trained approaches to extract context dependent features from raw signals for ASR tasks but they have not been used for SER yet. Very recently a BERT-like model for French has been developed [50].…”
Section: Pre-trained Features For Nlpmentioning
confidence: 99%