Yusuke Ijima scite author profile

SUMMARYDeep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMMbased speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. This paper investigates the effectiveness of introducing speaker codes to DNN acoustic models for speech synthesis for two tasks: multi-speaker modeling and speaker adaptation. For the multi-speaker modeling task, the method we propose trains connection weights of the whole DNN using a multi-speaker speech corpus. When performing multi-speaker synthesis, the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker's voice. When performing speaker adaptation, a set of connection weights of the multi-speaker model is re-estimated to generate a new target speaker's voice. We investigated the relationship between the prediction performance and architecture of the DNNs through objective measurements. Objective evaluation experiments revealed that the proposed model outperformed conventional methods (HMMs, speaker dependent DNNs and multi-speaker DNNs based on a shared hidden layer structure). Subjective evaluation experimental results showed that the proposed model again outperformed the conventional methods (HMMs, speaker dependent DNNs), especially when using a small number of target speaker utterances. key words: speech synthesis, acoustic model, deep neural network, speaker codes

show abstract

Soft-Target Training with Ambiguous Emotional Utterances for DNN-Based Speech Emotion Classification

Ando

Kobashikawa

Kamiyama

et al. 2018

View full text Add to dashboard Cite

Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks

Masumura

Ijima

Asami

et al. 2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yusuke Ijima

Generative adversarial network-based postfilter for statistical parametric speech synthesis

Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors

DNN-Based Speech Synthesis Using Speaker Codes

Soft-Target Training with Ambiguous Emotional Utterances for DNN-Based Speech Emotion Classification

Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks

Contact Info

Product

Resources

About