In this study, we propose a speaker-dependent WaveNet vocoder, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet. It is expected that WaveNet can learn a sample-by-sample correspondence between speech waveform and acoustic features. The advantage of the proposed method is that it does not require (1) explicit modeling of excitation signals and (2) various assumptions, which are based on prior knowledge specific to speech. We conducted both subjective and objective evaluation experiments on CMU-ARCTIC database. From the results of the objective evaluation, it was demonstrated that the proposed method could generate high-quality speech with phase information recovered, which was lost by a mel-cepstrum vocoder. From the results of the subjective evaluation, it was demonstrated that the sound quality of the proposed method was significantly improved from mel-cepstrum vocoder, and the proposed method could capture source excitation information more accurately.
This paper presents a statistical voice conversion (VC) technique with the WaveNet-based waveform generation. VC based on a Gaussian mixture model (GMM) makes it possible to convert the speaker identity of a source speaker into that of a target speaker. However, in the conventional vocoding process, various factors such as F0 extraction errors, parameterization errors and over-smoothing effects of converted feature trajectory cause the modeling errors of the speech waveform, which usually bring about sound quality degradation of the converted voice. To address this issue, we apply a direct waveform generation technique based on a WaveNet vocoder to VC. In the proposed method, first, the acoustic features of the source speaker are converted into those of the target speaker based on the GMM. Then, the waveform samples of the converted voice are generated based on the WaveNet vocoder conditioned on the converted acoustic features. In this paper, to investigate the modeling accuracies of the converted speech waveform, we compare several types of the acoustic features for training and synthesizing based on the WaveNet vocoder. The experimental results confirmed that the proposed VC technique achieves higher conversion accuracy on speaker individuality with comparable sound quality compared to the conventional VC technique.
This paper describes an extension of separable lattice 2-D HMMs (SL-HMMs) using state duration models for image recognition. SLHMMs are generative models which have size and location invariances based on state transition of HMMs. However, the state duration probability of HMMs exponentially decreases with increasing duration, therefore it may not be appropriate for modeling image variations accuratelty. To overcome this problem, we employ the structure of hidden semi Markov models (HSMMs) in which the state duration probability is explicitly modeled by parametric distributions. Face recognition experiments show that the proposed model improved the performance for images with size and location variations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.