Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-474
|View full text |Cite
|
Sign up to set email alerts
|

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 19 publications
0
6
0
Order By: Relevance
“…The attention module is shared by both streams to keep inter-stream synchrony. Following previous work [13], location sensitive attention mechanism [17] is adopted here to align text encoder outputs to acoustic feature sequences. The predicted mel-cepstra, logF0 and energy at the previous frame are passed to the prenet.…”
Section: Model Architecturementioning
confidence: 99%
See 2 more Smart Citations
“…The attention module is shared by both streams to keep inter-stream synchrony. Following previous work [13], location sensitive attention mechanism [17] is adopted here to align text encoder outputs to acoustic feature sequences. The predicted mel-cepstra, logF0 and energy at the previous frame are passed to the prenet.…”
Section: Model Architecturementioning
confidence: 99%
“…Following the previous work [13], an adversarial speaker classifier [18] with a gradient reversal layer is applied to the concatenated encoder output. It follows the principle of domain adversarial training [19] to remove the residual speaker information in the encoder output.…”
Section: Model Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…Based on the disentanglement strategy, the existing crosslingual approaches can be roughly divided into implicitbased and explicit-based methods [9]. Implicit-based methods mainly study the unified linguistic/phonetic representations across languages to disentangle language and speaker timbre implicitly [11], [12], [13], [14], [15], [16]. On the other hand, to further solve the foreign accent problem, the explicit-based methods prefer to adopt adversarial learning [1], [7], [9], [17] or mutual information [6] to minimize the correlation between different speech factors, thus encouraging the model to automatically learn disentangled linguistic representations.…”
Section: Introductionmentioning
confidence: 99%
“…[6,7] and [8,9] use Unicode bytes and Phonetic Posterior-Grams (PPGs) respectively as common phonetic set to build cross-lingual TTS systems and all get improvements compared to their baselines. Moreover, [10,11] convert all the graphemes of different languages into a same International Phonetic Alphabet (IPA) set to facilitate cross-lingual modeling and the experiments in [11] show the privilege of IPA over language-dependent phonemes.…”
Section: Introductionmentioning
confidence: 99%