2021
DOI: 10.48550/arxiv.2106.07447
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Abstract: Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned targe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
127
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 61 publications
(142 citation statements)
references
References 42 publications
0
127
1
Order By: Relevance
“…The loss weight on unmasked region does not have a large impact on the fine-tuning performance. This is different from the findings in Audio HuBERT Hsu et al (2021a), where masked prediction leads to much better performance. Given image frames as input, the prediction of cluster assignments, which are mostly determined by the accompanied audio stream, helps encode phonetic information into the visual representation.…”
Section: Modality Dropoutcontrasting
confidence: 99%
See 2 more Smart Citations
“…The loss weight on unmasked region does not have a large impact on the fine-tuning performance. This is different from the findings in Audio HuBERT Hsu et al (2021a), where masked prediction leads to much better performance. Given image frames as input, the prediction of cluster assignments, which are mostly determined by the accompanied audio stream, helps encode phonetic information into the visual representation.…”
Section: Modality Dropoutcontrasting
confidence: 99%
“…Our research builds on Audio HuBERT (Hsu et al, 2021a) which is a self-supervised learning framework for speech and audio. It alternates between two steps: feature clustering and masked predic-tion.…”
Section: Preliminary: Audio Hubertmentioning
confidence: 99%
See 1 more Smart Citation
“…For the regular and small models, we trained a 12-layer encoder with 2048 units, d = 256 of the hidden dimension size, and M = 4 multihead attention. The large model was trained with d = 512 and M = 8 and used HuBERT [25] features 1 pretrained on Libri-light [26].…”
Section: Methodsmentioning
confidence: 99%
“…1. Specifically, by utilizing the great reconstruction ability of the vector quantized variational autoencoder (VQ-VAE) [11] or HuBERT [12] + HiFi-GAN [13] model, we transfer the target speech of speech separation model into a sequence of discrete symbols. With the estimation of discrete symbol sequence, each target speech could be re-synthesized with optional transferred style.…”
Section: Introductionmentioning
confidence: 99%