Speaker-Dependent WaveNet Vocoder

Tamamori, Akira; Hayashi, Tomoki; Kobayashi, Kenzo; Takeda, Kazuya; Toda, Tomoki

doi:10.21437/interspeech.2017-314

Cited by 281 publications

(234 citation statements)

References 12 publications

Supporting

Mentioning

232

Contrasting

Order By: Relevance

“…Nevertheless, this information is crucial for the inversion from the frequency domain back into a temporal signal. Recent studies show that high quality speech waveforms can be synthesized by using Wavenet [46] conditioned on acoustic features estimated from a mel-cepstrum vocoder [47]. During network training, the model learns the link between speech signal and its acoustic features automatically without making any assumptions about prior knowledge of speech.…”

Section: Wavenet Vocoder For the Reconstruction Of Audible Waveformsmentioning

confidence: 99%

Speech Synthesis from ECoG using Densely Connected 3D Convolutional Neural Networks

Angrick

Mugler²,

Tate³

et al. 2018

Preprint

View full text Add to dashboard Cite

Objective. Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.

show abstract

Section: Wavenet Vocoder For the Reconstruction Of Audible Waveformsmentioning

confidence: 99%

Speech Synthesis from ECoG using Densely Connected 3D Convolutional Neural Networks

Angrick

Mugler²,

Tate³

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…The NU VC system uses a WaveNet-based vocoder [17,18,19] to model the waveform of the target speaker and generate the converted waveform using estimated speech features. Several flows are used in producing the estimated spectral features, where the direct waveform modification [2] method is employed.…”

Section: Waveform-processing Modulementioning

confidence: 99%

“…On the other hand, in the handling of prosodic parameters, such as fundamental frequency (F0), several methods have been commonly used including a simple mean/variance linear transformation, a contour-based transformation [13], GMM-based mapping [14], and neural network [15]. For waveform generation, approaches include the source-filter vocoder system [16], the latest direct waveform modification technique [2], and the use of state-ofthe-art WaveNet modeling [17,18,19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NU Voice Conversion System for the Voice Conversion Challenge 2018

Tobing¹,

Wu²,

Hayashi³

et al. 2018

EasyChair Preprints

Self Cite

View full text Add to dashboard Cite

This paper presents the NU (Nagoya University) voice conversion (VC) system for the HUB task of the Voice Conversion Challenge 2018 (VCC 2018). The design of the NU VC system can basically be separated into two modules consisting of a speech parameter conversion module and a waveformprocessing module. In the speech parameter conversion module, a deep learning framework is deployed to estimate the spectral parameters of a target speaker given those of a source speaker. Specifically, a deep neural network (DNN) and a deep mixture density network (DMDN) are used as the deep model structure. In the waveform-processing module, given the estimated spectral parameters and linearly transformed F0 parameters, the converted waveform is generated using a WaveNet-based vocoder system. To use the WaveNet-based vocoder, there are several generation flows based on an analysissynthesis framework to obtain the speech parameter set, on the basis of which a system selection process is performed to select the best one in an utterance-wise manner. The results of VCC 2018 ranked the NU VC system in second place with an overall mean opinion score (MOS) of 3.44 for speech quality and 85% accuracy for speaker similarity.

show abstract

“…More recently, deep learning techniques have reshaped the way speech synthesis is done. Many neural waveform synthesizers surpass traditional parametric synthesis models in speech quality [6,7]. These waveform synthesizers avoid many speech-specific assumptions by using generic neural networks, e.g., the convolution net in WaveNet [6] and recurrent net in SampleRNN [8].…”

Section: Introductionmentioning

confidence: 99%

Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation

Zhao

Wang

Juvela

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music audio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work compares three neural synthesizers used for musical instrument sounds generation under three scenarios: training from scratch on music data, zero-shot learning from the speech domain, and fine-tuning-based adaptation from the speech to the music domain. The results of a large-scale perceptual test demonstrated that the performance of three synthesizers improved when they were pre-trained on speech data and fine-tuned on music data, which indicates the usefulness of knowledge from speech data for music audio generation. Among the synthesizers, WaveGlow showed the best potential in zero-shot learning while NSF performed best in the other scenarios and could generate samples that were perceptually close to natural audio.

show abstract

Speaker-Dependent WaveNet Vocoder

Cited by 281 publications

References 12 publications

Speech Synthesis from ECoG using Densely Connected 3D Convolutional Neural Networks

Speech Synthesis from ECoG using Densely Connected 3D Convolutional Neural Networks

NU Voice Conversion System for the Voice Conversion Challenge 2018

Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation

Contact Info

Product

Resources

About