2022
DOI: 10.48550/arxiv.2205.11841
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SUSing: SU-net for Singing Voice Synthesis

Abstract: Singing voice synthesis is a generative task that involves multi-dimensional control of the singing model, including lyrics, pitch, and duration, and includes the timbre of the singer and singing skills such as vibrato. In this paper, we proposed SU-net for singing voice synthesis named SUSing. Synthesizing singing voice is treated as a translation task between lyrics and music score and spectrum. The lyrics and music score information is encoded into a two-dimensional feature representation through the convol… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 22 publications
0
1
0
Order By: Relevance
“…The empirical setting of σ in the objective function of confusing information. First define the recognition error rate E z of a training sentence z, where N = N 1 N 2 is the normalisation factor so that E z ∈ [0, 1], N 1 is the number of correctly labelled words in the training sentence z, N 2 is the correct recognition number of competing candidate words corresponding to the result, S zR is the correct recognition result corresponding to the feature X z (n), s zi is the candidate competing word sequence, P Λ (S zR | X z (n)) is the posterior probability that the feature X z (n) in the training sentence z is correctly identified as S zR under the acoustic model Λ and F BMPE (Λ) is the posterior probability that the feature X z (n) in the training sentence z is correctly identified as s zi under the acoustic model Λ [26].…”
Section: Discriminative Training Criteria To Strengthen Confusing Inf...mentioning
confidence: 99%
“…The empirical setting of σ in the objective function of confusing information. First define the recognition error rate E z of a training sentence z, where N = N 1 N 2 is the normalisation factor so that E z ∈ [0, 1], N 1 is the number of correctly labelled words in the training sentence z, N 2 is the correct recognition number of competing candidate words corresponding to the result, S zR is the correct recognition result corresponding to the feature X z (n), s zi is the candidate competing word sequence, P Λ (S zR | X z (n)) is the posterior probability that the feature X z (n) in the training sentence z is correctly identified as S zR under the acoustic model Λ and F BMPE (Λ) is the posterior probability that the feature X z (n) in the training sentence z is correctly identified as s zi under the acoustic model Λ [26].…”
Section: Discriminative Training Criteria To Strengthen Confusing Inf...mentioning
confidence: 99%