End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Tu, Tao; Chen, Yuan-Jui; Yeh, Cheng-chieh; Lee, Hung-yi

doi:10.21437/interspeech.2019-2730

Cited by 45 publications

(37 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ASR-TTS proposed by [30] need additional ASR to assist TTS learning. Mapping-based DTL is explored in [31] by adding a phonetic transformation network (PTN) model to learn a mapping between source and target linguistic symbols. An ASR system is used to train PTN separately.…”

Section: B Low-resource Problemmentioning

confidence: 99%

“…Our strategy is simpler than [30] as it does not need additional system such as ASR. Similar to [31], we apply DTL approach. However, our DTL is network-based approach that is more flexible than mapping-based DTL applied by [31] in which with multi stages of transfer learning the previous learned DNN parameters can be passed on to a larger network.…”

Section: B Low-resource Problemmentioning

confidence: 99%

“…Speech chain machine for cross-lingual is proposed by [30] that applies cycle consistency training for cross-lingual ASR-TTS. Work [31] proposed an approach to discover cross-lingual symbol mapping from abundant source data.…”

Section: Introductionmentioning

confidence: 99%

“…Especially for deep learning, a study by [26] classifies deep transfer learning (DTL) into different four categories: instances-based DTL, mapping-based DTL, network-based DTL, and adversarial-based DTL. Some DNN-based systems have successfully applied deep transfer learning, including TTS [31], image classification [34] [35], machine translation [36] [37], automatic speech recognition [38] [39] [40], language identification [41], and sentiment classification [42].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

2020

View full text Add to dashboard Cite

This work applies a hierarchical transfer learning to implement deep neural network (DNN)based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this paper, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio. INDEX TERMS deep neural network, hierarchical transfer learning, low-resource, multi-speaker, multilingual, style transfer, text-to-speech

show abstract

Section: B Low-resource Problemmentioning

confidence: 99%

Section: B Low-resource Problemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

2020

View full text Add to dashboard Cite

show abstract

“…We randomly select 200 sentences for IR test and 20 sentences for MOS test, following the same test configuration in English. Each audio is listened by at least 5 testers for IR test and 9 The audio samples and complete experiments results on IR and MOS for TTS, and WER and CER for ASR can be founded in https://speechresearch.github.io/lrspeech. 20 testers for MOS test, who are all native Lithuanian speakers.…”

Section: Apply To Truly Low-resource Language: Lithuanianmentioning

confidence: 99%

LRSpeech

Tan

Ren

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely lowresource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than 98% intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.

show abstract

Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data

Guo

Wang

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Cited by 45 publications

References 19 publications

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

LRSpeech

Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data

Contact Info

Product

Resources

About