Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

Yamagishi, Junichi; Usabaev, Bela; King, Simon; Watts, Oliver; Dines, John; Tian, Jilei; Guan, Yong; Hu, Rong; Oura, Keiichiro; Wu, Yi-Jian; Tokuda, Keiichi; Karhila, Reima; Kurimo, Mikko

doi:10.1109/tasl.2010.2045237

Cited by 68 publications

(43 citation statements)

References 42 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…F0 and F1 values for each of the 60 speakers were calculated at the midpoint of each vowel and we took the mean over all vowel tokens (once again using the Snack toolkit). We used the synthetic speech from Yamagishi et al (2010) for the 60 speakers. We also calculated F0 and F1 values for speaker 001's English synthetic speech and Japanese synthetic speech which was achieved by cross-lingual speaker adaptation based on his English speech data (2000 adaptation sentences).…”

Section: Speech Materials For Kld Analysismentioning

confidence: 99%

See 1 more Smart Citation

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Oura¹,

Yamagishi²,

Wester³

et al. 2012

Speech Communication

Self Cite

View full text Add to dashboard Cite

, K 2012, 'Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping ' Speech Communication, vol. 54, no. 6, pp. 703-714. DOI: 10.1016/j.specom.2011 General rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. AbstractIn the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

show abstract

Section: Speech Materials For Kld Analysismentioning

confidence: 99%

“…and the JNAS database for Japanese (Itou et al, 1998). Details of the front-end text processing used to derive phonetic-prosodic labels from the word transcriptions can be found in Yamagishi et al (2010).…”

Section: Introductionmentioning

confidence: 99%

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Oura¹,

Yamagishi²,

Wester³

et al. 2012

Speech Communication

Self Cite

View full text Add to dashboard Cite

show abstract

“…Nowadays two are the main speech processing techniques that allow the creation of synthetic speech spoofing signals: First, the statistical speech synthesizers (Yoshimura et al, 1999) (Tokuda et al, 2002) using voices adapted to a particular speaker (Yamagishi et al, 2009) even with minimum quality material (Yamagishi et al, 2010). Second, the voice conversion (VC) techniques (Jin et al, 2008), (Kinnunen et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

Synthetic speech detection using phase information

Saratxaga

Sánchez

et al. 2016

Speech Communication

View full text Add to dashboard Cite

Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical modulebased MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.

show abstract

“…The HMM-based speech synthesis systems have the ability to synthesize speech with a high degree of naturalness comparable to state-of-the-art unit selection systems [3]. The concept was first proposed by Yoshimura et al, [4] and was developed for languages such as Japanese, English, Thai, Romanian, Mandarin, Korean, Austria, Portuguese, Arabic, Hungarian and German among others [5]- [15].…”

Section: Introductionmentioning

confidence: 99%

Developing an HMM-Based Speech Synthesis System for Malay: A Comparison of Iterative and Isolated Unit Training

Mustafa

Don

Ainon

et al. 2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe development of an HMM-based speech synthesis system for a new language requires resources like speech database and segment-phonetic labels. As an under-resourced language, Malay lacks the necessary resources for the development of such a system, especially segment-phonetic labels. This research aims at developing an HMM-based speech synthesis system for Malay. We are proposing the use of two types of training HMMs, which are the benchmark iterative training incorporating the DAEM algorithm and isolated unit training applying segmentphonetic labels of Malay. The preferred method for preparing segmentphonetic labels is the automatic segmentation. The automatic segmentation of Malay speech database is performed using two approaches which are uniform segmentation that applies fixed phone duration, and a crosslingual approach that adopts the acoustic model of English. We have measured the segmentation error of the two segmentation approaches to ascertain their relative effectiveness. A listening test was used to evaluate the intelligibility and naturalness of the synthetic speech produced from the iterative and isolated unit training. We also compare the performance of the HMM-based speech synthesis system with existing Malay speech synthesis systems. key words: iterative training, isolated unit training, cross lingual approach, uniform segmentation, segment-phonetic labels

show abstract

Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

Cited by 68 publications

References 42 publications

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Synthetic speech detection using phase information

Developing an HMM-Based Speech Synthesis System for Malay: A Comparison of Iterative and Isolated Unit Training

Contact Info

Product

Resources

About