Speaker similarity evaluation of foreign-accented speech synthesis using HMM-based speaker adaptation

Wester, Mirjam; Karhila, Reima

doi:10.1109/icassp.2011.5947572

Cited by 13 publications

(22 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These findings give a good basis to further explore the behaviour of listeners in S2ST system evaluations. Preliminary experiments investigating various aspects of listeners' behaviour on synthetic speech in a S2ST context can be found in Wester and Karhila (2011); Karhila and Wester (2011);Wester and Liang (2011a).…”

Section: Discussionmentioning

confidence: 99%

Talker discrimination across languages

Wester

2012

Speech Communication

Self Cite

View full text Add to dashboard Cite

This study investigated the extent to which listeners are able to discriminate between bilingual talkers in three language pairs -English-German, English-Finnish and English-Mandarin. Native English listeners were presented with two sentences spoken by bilingual talkers and were asked to judge whether they thought the sentences were spoken by the same person. Equal amounts of cross-language and matched-language trials were presented. The results show that native English listeners are able to carry out this task well; achieving percent correct levels at well above chance for all three language pairs. Previous research has shown this for English-German, this research shows listeners also extend this to Finnish and Mandarin, languages that are quite distinct from English from a genetic and phonetic similarity perspective. However, listeners are significantly less accurate on cross-language talker trials (English-foreign) than on matched-language trials (English-English and foreign-foreign). Understanding listeners' behaviour in cross-language talker discrimination using natural speech is the first step in developing principled evaluation techniques for synthesis systems in which the goal is for the synthesised voice to sound like the original speaker, for instance, in speech-to-speech translation systems, voice conversion and reconstruction.

show abstract

Section: Discussionmentioning

confidence: 99%

Talker discrimination across languages

Wester

2012

Speech Communication

Self Cite

View full text Add to dashboard Cite

show abstract

“…In parallel with the research presented in this paper, other research has been investigating the above issues. For more details, please refer to Wester (2010); Wester and Karhila (2011);Tsuzaki et al (2011).…”

Section: Discussionmentioning

confidence: 99%

“…As references for judging the degree of speaker similarity of the synthetic speech to the original speaker, we used natural speech. However, it has been shown that there is a significant degradation in a listener's ability to decide on speaker similarity when comparing natural and synthetic speech stimuli (Wester and Karhila, 2011). The task here is further made more complex by requiring the listeners to rate speaker similarity across languages.…”

Section: Number Of Adaptation Sentencesmentioning

confidence: 99%

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Oura¹,

Yamagishi²,

Wester³

et al. 2012

Speech Communication

Self Cite

View full text Add to dashboard Cite

, K 2012, 'Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping ' Speech Communication, vol. 54, no. 6, pp. 703-714. DOI: 10.1016/j.specom.2011 General rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. AbstractIn the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech.

show abstract

“…A well-known method for data augmentation is speaker adaptation, where the most common approach is to build an average voice model of multiple speakers and then adapt a model for new (target) speaker from it. Speaker adaptation is a wellresearched topic in HMM-based speech synthesis [4,5,6,7,8,9] but still relatively unexplored for DNN-based synthesis. Arik et al [10] found that speaker adaptation by fine-tuning (i.e.…”

Section: Data Augmentationmentioning

confidence: 99%

Data Requirements, Selection and Augmentation for DNN-based Speech Synthesis from Crowdsourced Data

2018

View full text Add to dashboard Cite

Crowdsourcing speech recordings provides unique opportunities and challenges for personalized speech synthesis as it allows gathering of large quantities of data but with a huge variety in quality. Manual methods for data selection and cleaning quickly become infeasible, especially when producing larger quantities of voices. We present and analyze approaches for data selection and augmentation to cope with this. For differently-sized training sets, we assess speaker adaptation by transfer learning, including layer freezing, and sentence selection using maximum likelihood of forced alignment. The methodological framework utilizes statistical parametric speech synthesis based on Deep Neural Networks (DNNs). We compare objective scores for 576 voice models, representing all condition combinations. For a constrained set of conditions we also present results from a subjective listening test. We show that speaker adaptation improves overall quality in nearly all cases, sentence selection helps detecting recording errors, and layer freezing proves to be ineffective in our system. We also found that while Mel-Cepstral Distortion (MCD) does not correlate with listener preference across the range of values, the most preferred voices also exhibited the lowest values for MCD. These findings have implications on scalable methods of customized voice building and clinical applications with sparse data.

show abstract

Speaker similarity evaluation of foreign-accented speech synthesis using HMM-based speaker adaptation

Cited by 13 publications

References 10 publications

Talker discrimination across languages

Talker discrimination across languages

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Data Requirements, Selection and Augmentation for DNN-based Speech Synthesis from Crowdsourced Data

Contact Info

Product

Resources

About