2020
DOI: 10.48550/arxiv.2005.12963
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Contrastive Predictive Coding Supported Factorized Variational Autoencoder for Unsupervised Learning of Disentangled Speech Representations

Abstract: In this work we tackle disentanglement of speaker and content related variations in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a speaker encoder. To foster disentanglement we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision, not even speaker labels. With successful disentanglement the model is able to perform voice conversion by recombining content an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 17 publications
0
5
0
Order By: Relevance
“…The proposed system is shown in Figure 1. It consists of an adversarial CPC based VC system [22] for speaker normalization and a subsequent HMMVAE [7] to perform the AUD.…”
Section: System Setupmentioning
confidence: 99%
See 2 more Smart Citations
“…The proposed system is shown in Figure 1. It consists of an adversarial CPC based VC system [22] for speaker normalization and a subsequent HMMVAE [7] to perform the AUD.…”
Section: System Setupmentioning
confidence: 99%
“…For the voice conversion we here employ a Factorized Variational Autoencoder (FVAE) along with adversarial CPC as proposed in [22], which has shown to yield a well-balanced trade-off between linguistic content preservation and speaker invariance. The FVAE employs two convolutional encoders, namely, a content encoder outputting a series of content embeddings C = (c1, .…”
Section: Adversarial Cpc Based Voice Conversionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recent work from [10] aims to disentangle content from speaker for voice conversion tasks. While their technique does not necessarily rely on speaker labels, it does use vocal tract length perturbation (VTLP) to facilitate regularization during adversarial disentanglement.…”
Section: Related Workmentioning
confidence: 99%
“…In more recently emerged works, VAE has been combined with contrastive predictive coding (CPC) [12] to tackle voice conversion. In [13], CPC is used as an additional regularization objective for enhancing the content encoder. VQVAE-CPC [14] used vector-quantized VAE (VQ-VAE) [15] where the quantized latent representations could discard undesired attributes from the source speech more effectively.…”
Section: Introductionmentioning
confidence: 99%