Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1420
|View full text |Cite
|
Sign up to set email alerts
|

A Neural Parametric Singing Synthesizer

Abstract: We present a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Our model makes frame-wise predictions using mixture density outputs rather than categorical outputs … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
44
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 38 publications
(44 citation statements)
references
References 10 publications
0
44
0
Order By: Relevance
“…This model's ability to accurately generate raw speech waveform sample-by-sample, clearly shows that oversmoothing is not an issue. Recently, we presented a model for singing synthesis based on the WaveNet model [6], with an important difference being that we model vocoder features rather than raw waveform. While a vocoder unavoidably introduces some degradation in sound quality, we consider the degradation introduced by current models to still be the dominant factor.…”
Section: Introductionmentioning
confidence: 99%
“…This model's ability to accurately generate raw speech waveform sample-by-sample, clearly shows that oversmoothing is not an issue. Recently, we presented a model for singing synthesis based on the WaveNet model [6], with an important difference being that we model vocoder features rather than raw waveform. While a vocoder unavoidably introduces some degradation in sound quality, we consider the degradation introduced by current models to still be the dominant factor.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, several kinds of DNN-based singing voice synthesis systems [4,17,18,19,20] have been proposed. In the training part of the basic system [4], parameters for spectrum (e.g., melcepstral coefficients), excitation, and vibrato are extracted from a singing voice database as acoustic features.…”
Section: Dnn-based Singing Voice Synthesismentioning
confidence: 99%
“…This allows us to model temporal dependencies between features within that block. This temporal dependence is modelled via autoregression in the Neural Parametric Singing Synthesizer (NPSS) [2] model, which we use as a baseline in our study.…”
Section: Related Workmentioning
confidence: 99%
“…This is ideal for the singing voice as the pitch range of the voice while singing is much higher than that while speaking normally. Modelling the timbre independently of the pitch has been shown to be an effective methodology [2]. We note that the use of a vocoder for direct synthesis can lead to a degradation of sound quality, but this degradation can be mitigated by the use of a WaveNet vocoder trained to synthesize the waveform from the parametric vocoder features.…”
Section: Introductionmentioning
confidence: 99%