2023
DOI: 10.48550/arxiv.2302.03917
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Abstract: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(15 citation statements)
references
References 20 publications
0
15
0
Order By: Relevance
“…For these reasons, there is a growing interest in using alternative generative approaches, such as flow-based models [6] or, as in the present study, diffusion models [5], [7], [12], [30]. Other works also used diffusion-based audio super-resolution models within the context of text-to-audio generation, where their purpose was to separate the task of high-resolution audio generation into separate hierarchical steps [31], [32].…”
Section: A Audio Bandwidth Extension and Super-resolutionmentioning
confidence: 99%
See 1 more Smart Citation
“…For these reasons, there is a growing interest in using alternative generative approaches, such as flow-based models [6] or, as in the present study, diffusion models [5], [7], [12], [30]. Other works also used diffusion-based audio super-resolution models within the context of text-to-audio generation, where their purpose was to separate the task of high-resolution audio generation into separate hierarchical steps [31], [32].…”
Section: A Audio Bandwidth Extension and Super-resolutionmentioning
confidence: 99%
“…As a consequence of their high sampling rates, audio signals, when seen as vectors, are high-dimensional, a property that makes the training of a diffusion model difficult. Recent successful diffusion models in audio circumvent this issue by designing the diffusion process in a compressed latent space [32], [57] or by subdividing the task in a sequence of independent cascaded models [31]. However, utilizing reconstruction guidance without any further modifications requires designing a single-stage diffusion process in the raw audio domain because relying on a decoder or a super-resolution model could potentially harm the quality of the gradients.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…The Textune dataset's richness further bolstered the refinement of their transformer-centric methodology for deriving music from text. Drawing a parallel, Huang et al [245] demonstrated that utilizing LLMs to craft descriptive musical sentences can enhance the synthesis of text-conditioned music when amalgamated with a diffusion model. Donahue et al [246] introduced SingSong, a novel method for generating instrumental music tailored to complement specific vocal inputs.…”
Section: Large Audio Models In Musicmentioning
confidence: 99%
“…8: A prompt-completion example for Launch-PadGPT [262]: The text following "prompt:" represents MFCC feature values, while "completion:" shows RGB-X tuples. The tuple (245,5,169,1) indicates that the Launchpad keyboard's second button (index 0 for the first) is purple.. Figure taken from [262].…”
Section: Large Audio Models In Musicmentioning
confidence: 99%
“…Multimodal approaches that jointly process audio and language are becoming increasingly important within music understanding and generation, giving rise to a new area of research, which we refer to as music-and-language (M&L). Several recent works have emerged in this domain, proposing methods to automatically generate music descriptions [21,7,9], synthesise music from a text prompt [2,14,30,6], search for music based on language queries [8,22,13], and more [20,17]. However, evaluating M&L models remains a challenge due to a lack of public and accessible datasets with paired audio and language, resulting in the widespread use of private data [21,22,23,14,2,13] and inconsistent evaluation practices.…”
Section: Introductionmentioning
confidence: 99%