Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Huang, Rongjie; Huang, Jiawei; Yang, Dongchao; Ren, Yang; Liu, Luping; Li, Mingze; Ye, Zhenhui; Liu, Jinglin; Yin, Xia; Zhang, Zhao

doi:10.48550/arxiv.2301.12661

Cited by 10 publications

(11 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generative Models Incorporating other generative models in Ubiq-Genie, beyond the employed image and text synthesis models, could lead to interesting applications. Potential models to be integrated could be capable of synthesising 3D models from text or images such as Point-E [16], personalised speech from text such as VALL-E [24], or audio from text or images such as Make-An-Audio [9] and MusicLM [1]. In addition, the currently implemented services could be expanded to build more advanced types of applications and experiences.…”

Section: Services and Applicationsmentioning

confidence: 99%

Ubiq-Genie: Leveraging External Frameworks for Enhanced Social VR Experiences

Numan

Giunchi

Congdon

et al. 2023

2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

View full text Add to dashboard Cite

show abstract

Section: Services and Applicationsmentioning

confidence: 99%

Ubiq-Genie: Leveraging External Frameworks for Enhanced Social VR Experiences

Numan

Giunchi

Congdon

et al. 2023

2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

View full text Add to dashboard Cite

show abstract

“…Acoustic synthesis and spatialization. Researchers have explored visually-guided sound synthesis [26,31,40] and text-guided audio synthesis [47,92,38]. Additionally, researchers have investigated generating realistic environmental acoustics using visual information [11,82,15,55].…”

Section: Related Workmentioning

confidence: 99%

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Chen¹,

Qian²,

Owens³

2023

Preprint

View full text Add to dashboard Cite

The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from binaural sounds. We train these models to generate predictions that agree with one another. At test time, the models can be deployed independently. To obtain a feature representation that is well-suited to solving this challenging problem, we also propose a method for learning an audio-visual representation through cross-view binauralization: estimating binaural sound from one view, given images and sound from another. Our model can successfully estimate accurate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches. Project site: https://ificl.github.io/SLfM .

show abstract

“…In cases where audio information is the output, retrieval is applied in a music generation system with deep neural hashing that encodes the music segments (Royal et al, 2020). Audio-text retrieval is also applied to produce candidates in the process of pseudo prompt enhancement for text-to-audio generation (Huang et al, 2023a). Although there is a limited amount of research work which focuses on retrieval augmented generation tasks involving the audio, it could be a promising future direction (Li et al, 2022a).…”

Section: Audiomentioning

confidence: 99%

Retrieving Multimodal Information for Augmented Generation: A Survey

Zhao¹,

Chen²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this survey, we review methods that retrieve multimodal knowledge to assist and augment generative models. This group of works focuses on retrieving grounding contexts from external sources, including images, codes, tables, graphs, and audio. As multimodal learning and generative AI have become more and more impactful, such retrieval augmentation offers a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. We provide an in-depth review of retrieval-augmented generation in different modalities and discuss potential future directions. As this is an emerging field, we continue to add new papers and methods.

show abstract

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Cited by 10 publications

References 32 publications

Ubiq-Genie: Leveraging External Frameworks for Enhanced Social VR Experiences

Ubiq-Genie: Leveraging External Frameworks for Enhanced Social VR Experiences

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Retrieving Multimodal Information for Augmented Generation: A Survey

Contact Info

Product

Resources

About