Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Oura, Keiichiro; Yamagishi, Junichi; Wester, Mirjam; King, Simon; Tokuda, Keiichi

doi:10.1016/j.specom.2011.12.004

Cited by 16 publications

(10 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the ML-based eigenvoice approach, given some adaptation data χa = {x (1) , x (2) , ..., x (No,s) }, No,s is the total number of observations from speaker s, the likelihood function…”

Section: Eigenvoice Adaptationmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Lingual Speaker Adaptation for Statistical Speech Synthesis Using Limited Data

Sarfjoo

Demiroğlu

2016

Interspeech 2016

View full text Add to dashboard Cite

Cross-lingual speaker adaptation with limited adaptation data has many applications such as use in speech-to-speech translation systems. Here, we focus on cross-lingual adaptation for statistical speech synthesis (SSS) systems using limited adaptation data. To that end, we propose two techniques exploiting a bilingual Turkish-English speech database that we collected. In one approach, speaker-specific state-mapping is proposed for cross-lingual adaptation which performed significantly better than the baseline state-mapping algorithm in adapting the excitation parameter both in objective and subjective tests. In the second approach, eigenvoice adaptation is done in the input language which is then used to estimate the eigenvoice weights in the output language using weighted linear regression. The second approach performed significantly better than the baseline system in adapting the spectral envelope parameters both in objective and subjective tests.

show abstract

“…In the ML-based eigenvoice approach, given some adaptation data χa = {x (1) , x (2) , ..., x (No,s) }, No,s is the total number of observations from speaker s, the likelihood function…”

Section: Eigenvoice Adaptationmentioning

confidence: 99%

“…Cross-lingual speaker adaptation (CLSA) for statistical speech synthesis is used for adapting to a target speaker in an output language, using adaptation data from the speaker in an input language. CLSA algorithms have many applications such as deployment in speech-to-speech translation systems [1,2].…”

Section: Introductionmentioning

confidence: 99%

Cross-Lingual Speaker Adaptation for Statistical Speech Synthesis Using Limited Data

Sarfjoo

Demiroğlu

2016

Interspeech 2016

View full text Add to dashboard Cite

show abstract

“…Cross-lingual speaker adaptation (CLSA) for statistical speech synthesis is a method for adapting a text-to-speech (TTS) system for a desired output language, given adaptation data (i.e., speech) from the target speaker in a different input language. Applications include speech-to-speech translation [1], [2].…”

Section: Introductionmentioning

confidence: 99%

Using Eigenvoices and Nearest-Neighbors in HMM-Based Cross-Lingual Speaker Adaptation With Limited Data

Sarfjoo

Demiroğlu

King

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Cross-lingual speaker adaptation for speech synthesis has many applications, such as use in speech-to-speech translation systems. Here, we focus on cross-lingual adaptation for statistical speech synthesis systems using limited adaptation data. To that end, we propose two eigenvoice adaptation approaches exploiting a bilingual Turkish-English speech database that we collected. In one approach, eigenvoice weights extracted using Turkish adaptation data and Turkish voice models are transformed into the eigenvoice weights for the English voice models using linear regression. Weighting the samples depending on the distance of reference speakers to target speakers during linear regression was found to improve the performance. Moreover, importance weighting the elements of the eigenvectors during regression further improved the performance. The second approach proposed here is speaker-specific state-mapping which performed significantly better than the baseline state-mapping algorithm both in objective and subjective tests. Performance of the proposed state mapping algorithm was further improved when it was used with the intra-lingual eigenvoice approach instead of the linear-regression based algorithms used in the baseline system.

show abstract

“…For example, one-to-many Gaussian Mixture Model (GMM)-based voice conversion can be applied to unsupervised speaker adaptation in cross-lingual speech synthesis [11], [12]. In addition, cross-lingual adaptation parameter mapping [13]- [15] and cross-lingual frame mapping [16] have also been proposed for HMM-based speech synthesis. These approaches use a non-native speaker's natural voice in his/her mother tongue to extract speakerdependent acoustic characteristics and make it possible to synthesize naturally sounding target language voices.…”

mentioning

confidence: 99%

Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

Oshima

Takamichi

Toda

et al. 2016

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThis paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Crosslingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test. key words: cross-lingual speech synthesis, English-Read-by-Japanese, speaker individuality, HMM-based speech synthesis, prosody correction, phonetic correction

show abstract

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Cited by 16 publications

References 45 publications

Cross-Lingual Speaker Adaptation for Statistical Speech Synthesis Using Limited Data

Cross-Lingual Speaker Adaptation for Statistical Speech Synthesis Using Limited Data

Using Eigenvoices and Nearest-Neighbors in HMM-Based Cross-Lingual Speaker Adaptation With Limited Data

Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

Contact Info

Product

Resources

About