Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-897
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…[12] narrows the gap by introducing multiple reference encoders that encode both the synthesized and ground-truth (GT) acoustic features and allowing gradients flow through the synthesized ones to cover unseen cases in training. In [13], a consistency loss over speaker identity is applied by computing distance of speaker embeddings of the synthesized and the GT mel-spectrogram, which helps to keep target speaker identity after the cross-lingual transfer.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…[12] narrows the gap by introducing multiple reference encoders that encode both the synthesized and ground-truth (GT) acoustic features and allowing gradients flow through the synthesized ones to cover unseen cases in training. In [13], a consistency loss over speaker identity is applied by computing distance of speaker embeddings of the synthesized and the GT mel-spectrogram, which helps to keep target speaker identity after the cross-lingual transfer.…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by [13] and [12], we employ triplet loss [14] and propose a triplet training scheme to cover unseen cases. A triplet is composed of an anchor, a positive and a negative sample.…”
Section: Introductionmentioning
confidence: 99%
“…[17] proposed to train a multi-lingual speech emotion recognition model with adversarial domain adaptation. [18] introduced a speaker-independent speech encoder with DAT to further assist in text-to-speech synthesis. [19,20] proposed to apply DAT on speaker recognition models to address the domain mismatch problem.…”
Section: Introductionmentioning
confidence: 99%