Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1338
|View full text |Cite
|
Sign up to set email alerts
|

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

Abstract: This study tackles unsupervised subword modeling in the zeroresource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bo… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2
1

Relationship

4
2

Authors

Journals

citations
Cited by 7 publications
(17 citation statements)
references
References 18 publications
0
17
0
Order By: Relevance
“…A typical self-supervised representation learning model is the vector-quantized variational autoencoder (VQ-VAE) [14], which achieved a fairly good performance in ZeroSpeech 2017 [34] and 2019 [9], and has become more widely adopted [35]- [37] in the latest ZeroSpeech 2020 challenge [38]. Other selfsupervised learning algorithms such as factorized hierarchical VAE (FHVAE) [39], contrastive predictive coding (CPC) [22] and APC [26] were also extensively investigated in unsupervised subword modeling [27], [35], [40], [41].…”
Section: A Unsupervised Learning Techniquesmentioning
confidence: 99%
See 1 more Smart Citation
“…A typical self-supervised representation learning model is the vector-quantized variational autoencoder (VQ-VAE) [14], which achieved a fairly good performance in ZeroSpeech 2017 [34] and 2019 [9], and has become more widely adopted [35]- [37] in the latest ZeroSpeech 2020 challenge [38]. Other selfsupervised learning algorithms such as factorized hierarchical VAE (FHVAE) [39], contrastive predictive coding (CPC) [22] and APC [26] were also extensively investigated in unsupervised subword modeling [27], [35], [40], [41].…”
Section: A Unsupervised Learning Techniquesmentioning
confidence: 99%
“…Heck et al extended this approach by applying unsupervised speaker adaptation, which performed the best in ZeroSpeech 2017 [3]. In a recent study [13], a two-stage bottleneck feature (BNF) learning framework was proposed. The first stage, i.e., the front-end, used the factorized hierarchical variational autoencoder (FHVAE) [14] to learn speaker-invariant features.…”
Section: Introductionmentioning
confidence: 99%
“…During MFCC reconstruction, a male speaker for each of the two languages is randomly selected as the representative speaker for s-vector unification. Our recent research findings [11] showed that male speakers are more suitable than females in generating speaker-invariant features. The IDs of the selected speakers are 'S015' and 'S002' in English and Surprise respectively.…”
Section: System Setupmentioning
confidence: 96%
“…By concatenating training utterances of the same speaker into a single sequence for FHVAE training, the learned µ2 is expected to be discriminative to speaker identity. This work considers applying s-vector unification [11] to generate reconstructed feature representation that keeps linguistic content unchanged and is more speaker-invariant than the original representation. Specifically, a representative speaker with his/her s-vector (denoted as µ * 2 ) is chosen from the dataset.…”
Section: Speaker-invariant Feature Learning By Fhvaesmentioning
confidence: 99%
See 1 more Smart Citation