2020
DOI: 10.1109/access.2020.3027619
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Abstract: This work applies a hierarchical transfer learning to implement deep neural network (DNN)based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this paper, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 21 publications
(19 citation statements)
references
References 39 publications
0
19
0
Order By: Relevance
“…Work [41] proposed a cross-lingual mapping-based from high-resource language domains. In contrast to these approaches, our study proposes deep transfer learning to train a DNN-based TTS model since it is the most widely used in low-resource NLP, including our previous works [28] [19].…”
Section: A Nlp For Low-resourcementioning
confidence: 99%
See 1 more Smart Citation
“…Work [41] proposed a cross-lingual mapping-based from high-resource language domains. In contrast to these approaches, our study proposes deep transfer learning to train a DNN-based TTS model since it is the most widely used in low-resource NLP, including our previous works [28] [19].…”
Section: A Nlp For Low-resourcementioning
confidence: 99%
“…This is not only for single-speaker TTS, multi-speaker TTS systems also show outstanding results. However, many studies of multi-speaker TTS only focus on producing speech utterances from target speakers seen in training data [12] [13] [14] [15] [16] [17] [18] including our previous work [19]. It is challenging to extend speaker adaptation capabilities which can synthesize speech from target speakers that are not seen during model training.…”
Section: Introductionmentioning
confidence: 99%
“…These resulting values (n = 880) were used for analysis. [6], [7], [8], [9], [10], [11], [12] Hidden Markov Model synthesis (HMM) 7 [12], [13], [14], [15], [16], [17], [18] Neural network (non-S2S) synthesis (DNN) 9 [19], [20], [21], [22], [23], [24], [25], [26], [27] Sequence-to-sequence synthesis (S2S)…”
Section: Characteristics Of the Included Studiesmentioning
confidence: 99%
“…Intelligibility [23] WER (Word Error Rate) Intelligibility [20] MOS (Mean Opinion Score) Naturalness/ Quality [6], [8], [9], [14], [15], [16], [17], [18], [21], [23] A/B Preference (preference rate b/w test & control) Quality [5], [10], [11], [12], [13]…”
Section: Multilingual Model Effect (Mlme)mentioning
confidence: 99%
See 1 more Smart Citation