2023
DOI: 10.48550/arxiv.2301.12661
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Abstract: Large-scale multimodal generative modeling has created milestones in text-to-image and text-tovideo generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data sca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 32 publications
0
11
0
Order By: Relevance
“…Generative Models Incorporating other generative models in Ubiq-Genie, beyond the employed image and text synthesis models, could lead to interesting applications. Potential models to be integrated could be capable of synthesising 3D models from text or images such as Point-E [16], personalised speech from text such as VALL-E [24], or audio from text or images such as Make-An-Audio [9] and MusicLM [1]. In addition, the currently implemented services could be expanded to build more advanced types of applications and experiences.…”
Section: Services and Applicationsmentioning
confidence: 99%
“…Generative Models Incorporating other generative models in Ubiq-Genie, beyond the employed image and text synthesis models, could lead to interesting applications. Potential models to be integrated could be capable of synthesising 3D models from text or images such as Point-E [16], personalised speech from text such as VALL-E [24], or audio from text or images such as Make-An-Audio [9] and MusicLM [1]. In addition, the currently implemented services could be expanded to build more advanced types of applications and experiences.…”
Section: Services and Applicationsmentioning
confidence: 99%
“…Acoustic synthesis and spatialization. Researchers have explored visually-guided sound synthesis [26,31,40] and text-guided audio synthesis [47,92,38]. Additionally, researchers have investigated generating realistic environmental acoustics using visual information [11,82,15,55].…”
Section: Related Workmentioning
confidence: 99%
“…In cases where audio information is the output, retrieval is applied in a music generation system with deep neural hashing that encodes the music segments (Royal et al, 2020). Audio-text retrieval is also applied to produce candidates in the process of pseudo prompt enhancement for text-to-audio generation (Huang et al, 2023a). Although there is a limited amount of research work which focuses on retrieval augmented generation tasks involving the audio, it could be a promising future direction (Li et al, 2022a).…”
Section: Audiomentioning
confidence: 99%