2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461384
|View full text |Cite
|
Sign up to set email alerts
|

Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
70
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 110 publications
(71 citation statements)
references
References 9 publications
1
70
0
Order By: Relevance
“…Indeed, the challenge in developing the non-parallel spectral conversion model has attracted many works within the recent years, such as: with the use of clustered spectral matching algorithms [12,13]; with adaptation/alignment of speaker model parameters [14,15]; with restricted Boltzmann machine [16]; with generative adversarial networks (GAN)-based methods [17,18]; and with variational autoencoder (VAE)-based frameworks [19,20,21,22]. In this work, we focus on the use of VAE-based system, due to its potential in employing latent space to represent common hidden aspects of speech signal, between different speakers, e.g., phonetical attributes.…”
Section: Introductionmentioning
confidence: 99%
“…Indeed, the challenge in developing the non-parallel spectral conversion model has attracted many works within the recent years, such as: with the use of clustered spectral matching algorithms [12,13]; with adaptation/alignment of speaker model parameters [14,15]; with restricted Boltzmann machine [16]; with generative adversarial networks (GAN)-based methods [17,18]; and with variational autoencoder (VAE)-based frameworks [19,20,21,22]. In this work, we focus on the use of VAE-based system, due to its potential in employing latent space to represent common hidden aspects of speech signal, between different speakers, e.g., phonetical attributes.…”
Section: Introductionmentioning
confidence: 99%
“…The proposed training algorithm does not use pre-stored speakers' voice data, unlike one-to-many non-parallel VC [9] and Kinnunen et al's work [12]. One of our future directions is to use the pre-stored speakers' voice data to improve speaker individuality of the converted voice (e.g., using transfer learning [17]).…”
Section: Discussionmentioning
confidence: 99%
“…In one-to-many VC, i.e., VC from a source speaker to any arbitrary speakers, G(·) is often trained using multi-speaker corpora in advance, and is adapted using the specific targeted speaker's voice data. This paper utilizes the d-vector-based model adaptation method [9]. The method employs another DNN that estimates a d-vector (i.e., DNN-based continuous speaker representation [2]) of the targeted speaker, and feeds it to the VC model to convert the source speaker's features into the targeted speaker's ones.…”
Section: One-to-many Non-parallel Vcmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, neural network approaches are widely studied. Deep generative models like variational auto-encoder (VAE) [18,19] and generative adversarial networks (GANs) [20,21] without requiring parallel data are found effective for VC, though in the same language.…”
Section: Introductionmentioning
confidence: 99%