2022
DOI: 10.1609/aaai.v36i10.21350
|View full text |Cite
|
Sign up to set email alerts
|

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Abstract: Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt a simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from over-smoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we prop… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 111 publications
(42 citation statements)
references
References 26 publications
0
42
0
Order By: Relevance
“…Diffusion models (Sohl-Dickstein et al, 2015;Ho et al, 2020;Cao et al, 2022) are a class of generative models that have achieved impressive results in image (Song et al, 2020;Lugmayr et al, 2022;Whang et al, 2022;Baranchuk et al, 2021;Wolleb et al, 2022), speech (Lee & Han, 2021;Chen et al, 2020;Kong et al, 2020;Liu et al, 2022a) and text (Li et al, 2022;Chen et al, 2022;Austin et al, 2021) synthesis. Recently, FoldingDiff (Wu et al, 2022a) shows that language models could be used for unconditional protein generation.…”
Section: Language Diffusion For Protein Structure Generationmentioning
confidence: 99%
“…Diffusion models (Sohl-Dickstein et al, 2015;Ho et al, 2020;Cao et al, 2022) are a class of generative models that have achieved impressive results in image (Song et al, 2020;Lugmayr et al, 2022;Whang et al, 2022;Baranchuk et al, 2021;Wolleb et al, 2022), speech (Lee & Han, 2021;Chen et al, 2020;Kong et al, 2020;Liu et al, 2022a) and text (Li et al, 2022;Chen et al, 2022;Austin et al, 2021) synthesis. Recently, FoldingDiff (Wu et al, 2022a) shows that language models could be used for unconditional protein generation.…”
Section: Language Diffusion For Protein Structure Generationmentioning
confidence: 99%
“…To verify the singability in the end-to-end manner, we additionally use an open-source Chinese singing voice synthesis (SVS) model (Liu et al, 2022a) to supply the annotators with an actual audio rendition of the songs for more intuitive feeling. We randomly select 20 verses from the test set and show the music sheets and synthesized singing voice (see Appendix E) of each translated verse to five annotators.…”
Section: Evaluation Metricsmentioning
confidence: 99%
“…Denoising network θ in previous works can be mainly classified into two classes, Unet-based architecture (Ronneberger et al, 2015) for image-related tasks (Rombach et al, 2022;Voleti et al, 2022;Ho et al, 2020), andWaveNet-based architecture (van den Oord et al, 2016) for sequence-related tasks (Kong et al, 2020;Liu et al, 2022b;Kim et al, 2020). These networks consider the input as either grids or segments, lacking the ability to capture spatiotemporal correlations in STG data.…”
Section: Denoising Network: Ugnetmentioning
confidence: 99%