ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683623
|View full text |Cite
|
Sign up to set email alerts
|

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Abstract: In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner. The style representation learned through VAE shows good properties such as disentangling, scaling, and combination, which makes it easy for style control. Style transfer can be achieved in this framework by first inferring style representation through the recognition network of VAE, then feeding it into TTS network to guide the s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
155
0
4

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 212 publications
(159 citation statements)
references
References 10 publications
0
155
0
4
Order By: Relevance
“…In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody [22], as well as the linguistic content. Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition [64], prosody transfer [77,87], speaker verification [66], speech synthesis [31,77], and voice conversion [32], among other applications.…”
Section: Learning Disentangled Representationmentioning
confidence: 99%
“…In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody [22], as well as the linguistic content. Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition [64], prosody transfer [77,87], speaker verification [66], speech synthesis [31,77], and voice conversion [32], among other applications.…”
Section: Learning Disentangled Representationmentioning
confidence: 99%
“…Some researchers make some progress to use a reference encoder to capture prosody information from audios by several feature learning techniques [4][5] [6][7] [8]. The above models can transfer the prosody from reference audio to the audios to be synthesised.…”
Section: Graph Auxiliary Encodermentioning
confidence: 99%
“…For example, some researchers implemented open clones of Tacotron [66][67][68] to reproduce the speech of satisfactory quality as clear as the original work [69]. The authors in [70] introduced deep generative models, such as Variational Auto-encoder (VAE) [71], to Tacotron to explicitly model the latent representation of a speaker state in a continuous space, and additionally to control the speaking style in speech synthesis [70].…”
Section: Speech Synthesis Based On Tacotronmentioning
confidence: 99%