Noise2Music: Text-conditioned Music Generation with Diffusion Models

Huang, Qingqing; Park, Daniel; Wang, Tao; I., Denk, Timo; Ly, Andy; Chen, Nanxin; Zhang, Zhengdong; Zhang, Zhishuai; Yu, Jiahui; Frank, Christian; Engel, Jesse; Le, Quoc V.; Chan, William; Wang, Han

doi:10.48550/arxiv.2302.03917

Cited by 9 publications

(15 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For these reasons, there is a growing interest in using alternative generative approaches, such as flow-based models [6] or, as in the present study, diffusion models [5], [7], [12], [30]. Other works also used diffusion-based audio super-resolution models within the context of text-to-audio generation, where their purpose was to separate the task of high-resolution audio generation into separate hierarchical steps [31], [32].…”

Section: A Audio Bandwidth Extension and Super-resolutionmentioning

confidence: 99%

“…As a consequence of their high sampling rates, audio signals, when seen as vectors, are high-dimensional, a property that makes the training of a diffusion model difficult. Recent successful diffusion models in audio circumvent this issue by designing the diffusion process in a compressed latent space [32], [57] or by subdividing the task in a sequence of independent cascaded models [31]. However, utilizing reconstruction guidance without any further modifications requires designing a single-stage diffusion process in the raw audio domain because relying on a decoder or a super-resolution model could potentially harm the quality of the gradients.…”

Section: Implementation Detailsmentioning

confidence: 99%

See 1 more Smart Citation

BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks

Moliner

Välimäki

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Audio bandwidth extension aims to expand the spectrum of bandlimited audio signals. Although this topic has been broadly studied during recent years, the particular problem of extending the bandwidth of historical music recordings remains an open challenge. This paper proposes a method for the bandwidth extension of historical music using generative adversarial networks (BEHM-GAN) as a practical solution to this problem. The proposed method works with the complex spectrogram representation of audio and, thanks to a dedicated regularization strategy, can effectively extend the bandwidth of out-of-distribution real historical recordings. The BEHM-GAN is designed to be applied as a second step after denoising the recording to suppress any additive disturbances, such as clicks and background noise. We train and evaluate the method using solo piano classical music. The proposed method outperforms the compared baselines in both objective and subjective experiments. The results of a formal blind listening test show that BEHM-GAN significantly increases the perceptual sound quality in early-20th-century gramophone recordings. For several items, there is a substantial improvement in the mean opinion score after enhancing historical recordings with the proposed bandwidthextension algorithm. This study represents a relevant step toward data-driven music restoration in real-world scenarios.

show abstract

Section: A Audio Bandwidth Extension and Super-resolutionmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks

Moliner

Välimäki

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…The Textune dataset's richness further bolstered the refinement of their transformer-centric methodology for deriving music from text. Drawing a parallel, Huang et al [245] demonstrated that utilizing LLMs to craft descriptive musical sentences can enhance the synthesis of text-conditioned music when amalgamated with a diffusion model. Donahue et al [246] introduced SingSong, a novel method for generating instrumental music tailored to complement specific vocal inputs.…”

Section: Large Audio Models In Musicmentioning

confidence: 99%

“…8: A prompt-completion example for Launch-PadGPT [262]: The text following "prompt:" represents MFCC feature values, while "completion:" shows RGB-X tuples. The tuple (245,5,169,1) indicates that the Launchpad keyboard's second button (index 0 for the first) is purple.. Figure taken from [262].…”

Section: Large Audio Models In Musicmentioning

confidence: 99%

A survey on deep reinforcement learning for audio-based applications

Latif

Cuayáhuitl

Pervez³

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields including computer vision, natural language processing, healthcare, robotics, to name a few. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising applications in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together research studies across different but related areas in speech and music. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting important challenges faced by audio-based DRL agents and by highlighting open areas for future research and investigation. The findings of this paper will guide researchers interested in DRL for the audio domain.

show abstract

“…Multimodal approaches that jointly process audio and language are becoming increasingly important within music understanding and generation, giving rise to a new area of research, which we refer to as music-and-language (M&L). Several recent works have emerged in this domain, proposing methods to automatically generate music descriptions [21,7,9], synthesise music from a text prompt [2,14,30,6], search for music based on language queries [8,22,13], and more [20,17]. However, evaluating M&L models remains a challenge due to a lack of public and accessible datasets with paired audio and language, resulting in the widespread use of private data [21,22,23,14,2,13] and inconsistent evaluation practices.…”

Section: Introductionmentioning

confidence: 99%

MusCaps: Generating Captions for Music Audio

Manco

Benetos

Quinton³

et al. 2021

2021 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-andlanguage models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-tomusic generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

show abstract

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Cited by 9 publications

References 20 publications

BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks

BEHM-GAN: Bandwidth Extension of Historical Music Using Generative Adversarial Networks

A survey on deep reinforcement learning for audio-based applications

MusCaps: Generating Captions for Music Audio

Contact Info

Product

Resources

About