Building Personalized Synthetic Voices for Individuals with Dysarthria using the HTS Toolkit

Creer, Sarah; Green, Phil; Cunningham, Stuart; Yamagishi, Junichi

doi:10.4018/978-1-61520-725-1.ch006

Cited by 7 publications

(8 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Platform for medical voice banking: These voices may be used as a platform for medical voice banking. In [67], the HTS framework was used as personalized synthetic voices for patients who have dysarthria and thus require TTS systems as communication aids. The patients can choose the most similar voice from a wide variety of voices.…”

Section: Discussionmentioning

confidence: 99%

Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

Yamagishi

Usabaev

King

et al. 2010

IEEE Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases.

show abstract

Section: Discussionmentioning

confidence: 99%

Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

Yamagishi

Usabaev

King

et al. 2010

IEEE Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…This method [9] starts with a speaker-independent model, or "average voice model", learned over multiple speakers and uses model adaptation techniques drawn from speech recognition such as maximum likelihood linear regression (MLLR), to adapt the speaker independent model to a new speaker. It has been shown that using 100 sentences or approximately 6-7 minutes of speech data is sufficient to generate a speaker-adapted voice that sounds similar to the target speech [7]. In the following of this paper we refer the speaker-adapted voices as "voice clones".…”

Section: Speaker Adaptationmentioning

confidence: 99%

“…This approach can be seen as a first attempt of model-based voice reconstruction although it relies only on a partial modeling of the voice components. A voice building process using the hidden Markov model (HMM)-based speech synthesis technique has been investigated to create personalized VOCAs [7][8][9][10]. This approach has been shown to produce high quality output and offers two major advantages over existing methods for voice banking and voice building.…”

Section: Introductionmentioning

confidence: 99%

A Comparison of Manual and Automatic Voice Repair for Individual with Vocal Disabilities

Veaux¹,

Yamagishi²,

King

2015

Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

Self Cite

View full text Add to dashboard Cite

When individuals lose the ability to produce their own speech, due to degenerative diseases such as motor neurone disease (MND) or Parkinson's, they lose not only a functional means of communication but also a display of their individual and group identity. In order to build personalized synthetic voices, attempts have been made to capture the voice before it is lost, using a process known as voice banking. But, for some patients, the speech deterioration frequently coincides or quickly follows diagnosis. Using HMM-based speech synthesis, it is now possible to build personalized synthetic voices with minimal data recordings and even disordered speech. The power of this approach is that it is possible to use the patient's recordings to adapt existing voice models pre-trained on many speakers. When the speech has begun to deteriorate, the adapted voice model can be further modified in order to compensate for the disordered characteristics found in the patient's speech, we call this process "voice repair". In this paper we compare two methods of voice repair. The first method follows a trial and error approach and requires the expertise of a speech therapist. The second method is entirely automatic and based on some a priori statistical knowledge. A subjective evaluation shows that the automatic method achieves similar results than the manually controlled method.

show abstract

“…While a state-of-the-art concatenative method [1,2] for TTS is capable of synthesizing natural and smooth speech for a specific voice, an SSS-based approach [3,4] has the strength to produce a diverse spectrum of voices without requiring significant amount of new data. This is an important feature for building next-generation applications such as a story-telling robot capable of synthesizing the speech of multiple characters with different emotions, personalized speech synthesis such as in speechto-speech translation [5,6], and clinical applications such as voice reconstruction of patients with speech disorders [7]. In this article, we study the problem of generating new models of SSS from existing models.…”

Section: Introductionmentioning

confidence: 99%

Speaker-dependent model interpolation for statistical emotional speech synthesis

Hsu

Chen

2012

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

In this article, we propose a speaker-dependent model interpolation method for statistical emotional speech synthesis. The basic idea is to combine the neutral model set of the target speaker and an emotional model set selected from a pool of speakers. For model selection and interpolation weight determination, we propose to use a novel monophone-based Mahalanobis distance, which is a proper distance measure between two Hidden Markov Model sets. We design Latin-square evaluation to reduce the systematic bias in the subjective listening tests. The proposed interpolation method achieves sound performance on the emotional expressiveness, the naturalness, and the target speaker similarity. Moreover, such performance is achieved without the need to collect the emotional speech of the target speaker, saving the cost of data collection and labeling.

show abstract

Building Personalized Synthetic Voices for Individuals with Dysarthria using the HTS Toolkit

Cited by 7 publications

References 26 publications

Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

A Comparison of Manual and Automatic Voice Repair for Individual with Vocal Disabilities

Speaker-dependent model interpolation for statistical emotional speech synthesis

Contact Info

Product

Resources

About