Modeling and Transforming Speech Using Variational Autoencoders

Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios. In most situations, the source and the target speakers do not repeat the same texts or they may even speak different languages. In this case, one possible, although indirect, solution is to build a generative model for speech. Generative models focus on explaining the observations with latent variables instead of learning a pairwise transformation function, thereby bypassing the requirement of speech frame alignment. In this paper, we propose a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model. Experimental results corroborate the capability of our framework for building a VC system from unaligned data, and demonstrate improved conversion quality. Index Terms: non-parallel voice conversion, Wasserstein generative adversarial network, GAN, variational autoencoder, VAE

show abstract

“…Recent works have proven the viability of speech modeling with VAEs [3,4]. A C-VAE that realizes the PGM in Fig.…”

Section: Modeling Speech With a C-vaementioning

confidence: 99%

Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks

Hsu¹,

Hwang²,

Wu³

et al. 2017

Interspeech 2017

235

150

View full text Add to dashboard Cite

show abstract

“…In the speech and audio domain, VAEs have mainly been used for speech generation and transformation [8]. They have also been used to learn phonetic content or speaker identity in speech segments without supervisory data [7,8]. Moreover, a framework based on VAE was used in [15] to learn both frame-level and utterance-level robust representations.…”

Section: Related Workmentioning

confidence: 99%

Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

et al. 2018

View full text Add to dashboard Cite

Learning the latent representation of data in unsupervised fashion is a very interesting process that provides relevant features for enhancing the performance of a classifier. For speech emotion recognition tasks, generating effective features is crucial. Currently, handcrafted features are mostly used for speech emotion recognition, however, features learned automatically using deep learning have shown strong success in many problems, especially in image processing. In particular, deep generative models such as Variational Autoencoders (VAEs) have gained enormous success for generating features for natural images. Inspired by this, we propose VAEs for deriving the latent representation of speech signals and use this representation to classify emotions. To the best of our knowledge, we are the first to propose VAEs for speech emotion classification. Evaluations on the IEMOCAP dataset demonstrate that features learned by VAEs can produce state-of-the-art results for speech emotion classification.

show abstract

“…A VAEbased framework was used in [18] to extract both frame-level and utterance-level features that were used in combination with other features for robust speech recognition. A fully-connected VAE was used in [14] to learn a frame-level latent representation, and evaluated using a Gaussian diffusion process to generate and concatenate multiple samples that varied smoothly in time.…”

Section: Related Workmentioning

confidence: 99%

Learning Latent Representations for Speech Generation and Transformation

2017

View full text Add to dashboard Cite

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.

show abstract

Modeling and Transforming Speech Using Variational Autoencoders

Cited by 57 publications

References 12 publications

Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks

Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks

Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study

Learning Latent Representations for Speech Generation and Transformation

Contact Info

Product

Resources

About