2021
DOI: 10.1109/taslp.2021.3104165
|View full text |Cite
|
Sign up to set email alerts
|

Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

Abstract: This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. In recent years, DNNs have been utilized in statistical parametric SVS systems, and DNN-based SVS systems have demonstrated better performance than conventional hidden Markov model-based ones. SVS systems are required to synthesize a singing voice with pitch and timing that strictly follow a given musical score. Additionally, singing expressions that are not described on the musical score, such as vibrato and tim… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(7 citation statements)
references
References 31 publications
0
7
0
Order By: Relevance
“…In [95], a non-autoregressive neural vocoder called Period-Net [107] is adopted, which is a non-autoregressive GANbased neural vocoder that is shown to be more robust for generating accurate pitch. Moreover, an automatic pitch correction technique is incorporated that ensures accurate pitch in the synthesized singing voices.…”
Section: Multi-variate Density Outputmentioning
confidence: 99%
See 1 more Smart Citation
“…In [95], a non-autoregressive neural vocoder called Period-Net [107] is adopted, which is a non-autoregressive GANbased neural vocoder that is shown to be more robust for generating accurate pitch. Moreover, an automatic pitch correction technique is incorporated that ensures accurate pitch in the synthesized singing voices.…”
Section: Multi-variate Density Outputmentioning
confidence: 99%
“…Another unique characteristic of singing voices is that F0 includes periodic fluctuations due to vibrato. In [75], [95], the vibrato was separated from the original F0 sequence in advance and modeled with sinusoidal parameters. The advantage of this approach is that it provides direct control over the vibrato intensity and frequency in the synthesis stage.…”
Section: Multi-variate Density Outputmentioning
confidence: 99%
“…A method of explicitly modeling information such as pitch curves, energy, V/UV., which can be extracted directly from the vocal signal, was proposed in [2]. [3] proposed a method to interpret the music score more naturally by introducing a module that predicts the difference between the actual singing and the score. Efforts to create natural pitch contour have also been made in various ways, such as directly predicting f0 from note sequences [11,10,13], or predicting variables of the parametric f0 contours [19].…”
Section: Related Workmentioning
confidence: 99%
“…Singing voice synthesis (SVS) is the task of generating a natural singing voice from a given musical score. With the development of various deep generative models, research on synthesizing high-quality singing voice has been emerging recently [1,2,3,4]. As the performance of the SVS improves, there are increasing cases in which the technology is applied to the production of actual music content [5].…”
Section: Introductionmentioning
confidence: 99%
“…To further distinguish vowels and consonants, a duration predictor is built to produce fine-grained *Corresponding author. phoneme-level duration, which is trained based on supervision calculated by force-alignment [6][7][8][9][10][11], heuristics [12][13][14][15] etc. The advantage of this type of feature processing strategy is that the input phoneme and pitch sequence are strictly aligned at the note level based on the music score.…”
Section: Introductionmentioning
confidence: 99%