ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414226
|View full text |Cite
|
Sign up to set email alerts
|

Disentangled Speaker and Language Representations Using Mutual Information Minimization and Domain Adaptation for Cross-Lingual TTS

Abstract: We propose a method for obtaining disentangled speaker and language representations via mutual information minimization and domain adaptation for cross-lingual text-to-speech (TTS) synthesis. The proposed method extracts speaker and language embeddings from acoustic features by a speaker encoder and a language encoder. Then the proposed method applies domain adaptation on the two embeddings to obtain language-invariant speaker embedding and speaker-invariant language embedding. To get more disentangled represe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(14 citation statements)
references
References 16 publications
0
14
0
Order By: Relevance
“…All the texts are converted into IPA symbols via espeak 1 . 80-dimensional mel-spectrograms are extracted by using Hanning window with frame shift 10 ms and frame length 42.7 ms. Then, Kaldi toolkit [18] is utilized to do forcedalignment.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…All the texts are converted into IPA symbols via espeak 1 . 80-dimensional mel-spectrograms are extracted by using Hanning window with frame shift 10 ms and frame length 42.7 ms. Then, Kaldi toolkit [18] is utilized to do forcedalignment.…”
Section: Methodsmentioning
confidence: 99%
“…Building a cross-lingual TTS system is a task that requires the system to synthesize speech of a language foreign to a target speaker [1]. It is straightforward to build a multi-lingual TTS system with a multi-lingual corpus using end-to-end models [2,3,4], yet such corpus of a same speaker is often difficult to collect.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Here, reconstruction loss is used to ensure that the autoencoder architecture does not lose too much information. Recent studies [22], [23], [24], [25], [26], [27] have demonstrated that mutual information minimization is an effective method for extracting disentangled representations in various style transfer tasks. To achieve better disentanglement of the emotion representations and emotion-independent representations of the input speech, we incorporate mutual information minimization into the autoencoder training process.…”
Section: Introductionmentioning
confidence: 99%
“…However, such methods have largely remained absent from work in the speech domain, particularly in the field of automated speech recognition. Recently, Xin et al [13] and Gong et al [14] demonstrated the benefits of disentangled representations when generating speech from text, Wang et al [15] showed the same for voice conversation, while Kwon et al [4] demonstrated gains in speaker recognition. Thus, it is clear that explicitly disentangled representations have the potential to improve generalization in a wide range of speech-related problems.…”
Section: Introductionmentioning
confidence: 99%