10th ISCA Workshop on Speech Synthesis (SSW 10) 2019
DOI: 10.21437/ssw.2019-15
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion

Abstract: This paper presents an approach to voice conversion, which does neither require parallel data nor speaker or phone labels for training. It can convert between speakers which are not in the training set by employing the previously proposed concept of a factorized hierarchical variational autoencoder. Here, linguistic and speaker induced variations are separated upon the notion that content induced variations change at a much shorter time scale, i.e., at the segment level, than speaker induced variations, which … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 12 publications
0
7
0
Order By: Relevance
“…Among the different models rising in the field in deep generative modeling, variational auto-encoder [17,18] has become probably the most popular in speech processing [19,20]. Variational auto-encoder is a deep latent variable model, where the probability density of the observation depends on the output of a highly non-linear transformation of the latent variable, where the transformation is modeled by a neural network (known as a decoder).…”
Section: Variational Autoencoder As Spectral Modelmentioning
confidence: 99%
See 4 more Smart Citations
“…Among the different models rising in the field in deep generative modeling, variational auto-encoder [17,18] has become probably the most popular in speech processing [19,20]. Variational auto-encoder is a deep latent variable model, where the probability density of the observation depends on the output of a highly non-linear transformation of the latent variable, where the transformation is modeled by a neural network (known as a decoder).…”
Section: Variational Autoencoder As Spectral Modelmentioning
confidence: 99%
“…In our work, we use VAE to model single-speaker speech in the log-magnitude STFT domain. We use a (slightly modified) model from [20], which was inspired by [19]. In this model, part of the latent variables is trained to model speaker variability.…”
Section: Variational Autoencoder As Spectral Modelmentioning
confidence: 99%
See 3 more Smart Citations