ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054591
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

Abstract: We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 27 publications
(21 citation statements)
references
References 15 publications
0
21
0
Order By: Relevance
“…Mutual information (MI) measures the dependence of two random variables from the perspective of information [15,20]. Given two random variables X and Y , the MI I(X; Y ) between them is equivalent to the Kullback-Leibler (KL) divergence between their joint distribution, PX,Y , and the product of marginals, PX PY .…”
Section: Algorithm 1 Pseudocode For MI Estimator Trainingmentioning
confidence: 99%
See 2 more Smart Citations
“…Mutual information (MI) measures the dependence of two random variables from the perspective of information [15,20]. Given two random variables X and Y , the MI I(X; Y ) between them is equivalent to the Kullback-Leibler (KL) divergence between their joint distribution, PX,Y , and the product of marginals, PX PY .…”
Section: Algorithm 1 Pseudocode For MI Estimator Trainingmentioning
confidence: 99%
“…where M can be any function that forces the two expectations in the above equation to be finite. The authors in [15] proposed the use of a deep neural network for M , which enables the MI between X and Y to be estimated by maximizing the lower bound in Eq. 3 with respect to M using gradient descent.…”
Section: Algorithm 1 Pseudocode For MI Estimator Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…Similar to GST-Tacotron, VAE-based methods also use a large dataset to train a style encoder. Recently, a mutual information based method is proposed in [21]. This method disentangle content and style information by minimizing mutual information between content vector and style vector using a mutual information neural estimator (MINE) [22].…”
Section: Related Workmentioning
confidence: 99%
“…In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody [22], as well as the linguistic content. Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition [64], prosody transfer [77,87], speaker verification [66], speech synthesis [31,77], and voice conversion [32], among other applications.…”
Section: Learning Disentangled Representationmentioning
confidence: 99%