2021
DOI: 10.48550/arxiv.2106.07431
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

Simon Rouard,
Gaëtan Hadjeres

Abstract: In this paper, we propose a novel score-base generative model for unconditional raw audio synthesis. Our proposal builds upon the latest developments on diffusion process modeling with stochastic differential equations, which already demonstrated promising results on image generation. We motivate novel heuristics for the choice of the diffusion processes better suited for audio generation, and consider the use of a conditional U-Net to approximate the score function. While previous approaches on diffusion mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 7 publications
0
6
0
Order By: Relevance
“…Pioneering work WaveGrad [7] Code / Project DiffWave [45] Code Efficient vocoder BDDM [48] Code InferGrad [9] WaveFit [43] Project Statistical improvement DDGM [70] PriorGrad [50] Project ItôWave [125] Project SpecGrad [44] End-to-end Pioneering work WaveGrad 2 [8] Code / Project CRASH [90] Project Efficient model FastDiff [26] Code / Project Further improvements DAG [79] Itôn [99] Project statistical parametric speech synthesis (SPSS) was a popular method [115,116,132,133,137] consisting of three stages. As shown in Figure 1 (a), the text input is first converted to linguistic features, then acoustic features, and to the waveform in the last stage.…”
Section: Overview Of the Text-to-speech Developmentmentioning
confidence: 99%
See 2 more Smart Citations
“…Pioneering work WaveGrad [7] Code / Project DiffWave [45] Code Efficient vocoder BDDM [48] Code InferGrad [9] WaveFit [43] Project Statistical improvement DDGM [70] PriorGrad [50] Project ItôWave [125] Project SpecGrad [44] End-to-end Pioneering work WaveGrad 2 [8] Code / Project CRASH [90] Project Efficient model FastDiff [26] Code / Project Further improvements DAG [79] Itôn [99] Project statistical parametric speech synthesis (SPSS) was a popular method [115,116,132,133,137] consisting of three stages. As shown in Figure 1 (a), the text input is first converted to linguistic features, then acoustic features, and to the waveform in the last stage.…”
Section: Overview Of the Text-to-speech Developmentmentioning
confidence: 99%
“…Experimental results show that WaveGrad 2 [8] can generate high-quality audio in an end-to-end manner compared to strong baselines. Controllable Raw audio synthesis with High-resolution (CRASH) [90] is a concurrent work to WaveGrad 2 [8] that proposes an end-to-end model for drum sound synthesis. Based on SDE, CRASH [90] applies a noise-conditioned U-Net to estimate the score function, and introduces a class-mixing sampling to generate 'hybrid' sounds.…”
Section: End-to-end Frameworkmentioning
confidence: 99%
See 1 more Smart Citation
“…Inspired by the recent successes of diffusion models Sohl-Dickstein et al (2015); Ho et al (2020); Kingma et al (2021) in solving audio tasks Rouard and Hadjeres (2021); Kong et al (2020), we chose to employ them for the task of synthesizing mel spectrograms. Diffusion models can be thought of as a Markovian Hierarchical Variational Autoencoder Luo (2022).…”
Section: Diffusionmentioning
confidence: 99%
“…In recent years, the field of generative modeling has seen significant growth with various techniques, including generative adversarial networks (GANs) Goodfellow et al (2020), variational autoencoders (VAEs) Kingma and Welling (2013), normalizing flows Rezende and Mohamed (2015), autoregressive models Dhariwal et al (2020), and diffusion models Sohl-Dickstein et al (2015); Ho et al (2020); Kingma et al (2021), driving progress in various fields. These techniques have achieved human-level performance in tasks such as image generation Rombach et al (2022); Karras et al (2020); Dhariwal and Nichol (2021); Saharia et al (2022); Ramesh et al (2022), speech generation Kong et al (2020); Shen et al (2017), and text generation Brown et al (2020); Scao et al (2022), as well as progressed music generation Dhariwal et al (2020); Rouard and Hadjeres (2021); Engel et al (2019); Marafioti et al (2019) and other areas.…”
Section: Introductionmentioning
confidence: 99%