MusCaps: Generating Captions for Music Audio

Manco, Ilaria; Benetos, Emmanouil; Quinton, Elio; Fazekas, György

doi:10.1109/ijcnn52387.2021.9533461

Cited by 15 publications

(24 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[160] is proposed to generate descriptions for music playlists by combining audio content analysis and natural language processing to utilize the information of each track. MusCaps [161] is a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations. For music and language pre-training, Manco et al [162] propose a multimodal architecture, which uses weakly aligned text as the only supervisory signal to learn general-purpose music audio representations.…”

Section: Text Audio Generationmentioning

confidence: 99%

“…Text-Audio Generation https://github.com/rishikksh20/AdaSpeech2 2020 Lombard [154] Text-Audio Generation https://github.com/dipjyoti92/TTS-Style-Transfer 2019 Zhang et al [156] Text-Audio Generation https://github.com/PaddlePaddle/PaddleSpeech 2019 Yu et al [157] Text-Music Generation -2018 JTAV [158] Text-Music Generation https://github.com/mengshor/JTAV 2021 Ferraro et al [159] Text-Music Generation https://github.com/andrebola/contrastive-mir-learning 2016 Choi et al [160] Text-Music Generation -2021 MusCaps [161] Text-Music Generation https://github.com/ilaria-manco/muscaps 2022 Manco et al [162] Text-Music Generation https://github.com/ilaria-manco/mulap 2022 CLAP [163] Text-Music Generation https://github.com/YuanGongND/vocalsound 2020 Jukebox [203] Text-Music Generation https://github.com/openai/jukebox…”

Section: A Curated Advances In Generative Aimentioning

confidence: 99%

See 1 more Smart Citation

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Cao¹,

Li²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, ChatGPT, along with DALL-E-2 [1] and Codex [2],has been gaining significant attention from society. As a result, many individuals have become interested in related resources and are seeking to uncover the background and secrets behind its impressive performance. In fact, ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC), which involves the creation of digital content, such as images, music, and natural language, through AI models. The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. AIGC is achieved by extracting and understanding intent information from instructions provided by human, and generating the content according to its knowledge and the intent information. In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and thus, improved generation results. With the growth of data and the size of the models, the distribution that the model can learn becomes more comprehensive and closer to reality, leading to more realistic and high-quality content generation. This survey provides a comprehensive review on the history of generative models, and basic components, recent advances in AIGC from unimodal interaction and multimodal interaction. From the perspective of unimodality, we introduce the generation tasks and relative models of text and image. From the perspective of multimodality, we introduce the cross-application between the modalities mentioned above. Finally, we discuss the existing open problems and future challenges in AIGC.

show abstract

Section: Text Audio Generationmentioning

confidence: 99%

Section: A Curated Advances In Generative Aimentioning

confidence: 99%

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Cao¹,

Li²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Core research in MIR, however, still focuses on tasks such as key and chord recognition [100], [101], tempo and beat tracking [102], the detection of musical note onsets [103], [104], automatic music transcription [105], classification [106], and description (also known as captioning) [107], [108] as well as music emotion recognition [109]- [111]. A large body of research considers musical audio in these tasks to support search, retrieval and interaction use cases.…”

Section: Software: -Acoustic Features Extraction Algorithms -Detectio...mentioning

confidence: 99%

“…Generating full sentence descriptions of a music piece may be considered an extension of the tagging problem. This involves the use of an acoustic model and a large language model [108].…”

Section: Software: -Acoustic Features Extraction Algorithms -Detectio...mentioning

confidence: 99%

The Internet of Sounds: Convergent Trends, Insights, and Future Directions

Turchet

Lagrange

Rottondi

et al. 2023

IEEE Internet Things J.

View full text Add to dashboard Cite

Current sound-based practices and systems developed in both academia and industry point to convergent research trends that bring together the field of Sound and Music Computing with that of the Internet of Things. This paper proposes a vision for the emerging field of the Internet of Sounds (IoS), which stems from such disciplines. The IoS relates to the network of Sound Things, i.e., devices capable of sensing, acquiring, processing, actuating, and exchanging data serving the purpose of communicating sound-related information. In the IoS paradigm, which merges under a unique umbrella the emerging fields of the Internet of Musical Things and the Internet of Audio Things, heterogeneous devices dedicated to musical and nonmusical tasks can interact and cooperate with one another and with other things connected to the Internet to facilitate soundbased services and applications that are globally available to the users. We survey the state of the art in this space, discuss the technological and non-technological challenges ahead of us and propose a comprehensive research agenda for the field.

show abstract

“…To address the above limitations, Choi et al (2016) introduce the task of playlist captioning, that is automatically describing a playlist using natural language. Playlist captioning can enable several useful applications, such as assisting curators in the process of finding an appropriate caption for a playlist; enabling search and discovery of playlists through human-like queries (Manco et al, 2021); assigning captions to algorithm-generated playlists, that could be also used as explanations for automatic playlist recommendations (Afchar et al, 2022). Even so, playlist captioning is still an under-researched topic.…”

Section: Introductionmentioning

confidence: 99%

Data-Efficient Playlist Captioning With Musical and Linguistic Knowledge

Gabbolini¹,

Hennequin²,

Epure³

2022

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Music streaming services feature billions of playlists created by users, professional editors or algorithms. In this content overload scenario, it is crucial to characterise playlists, so that music can be effectively organised and accessed. Playlist titles and descriptions are proposed in natural language either manually by music editors and users or automatically from pre-defined templates. However, the former is time-consuming while the latter is limited by the vocabulary and covered music themes. In this work, we propose PLAYNTELL, a dataefficient multi-modal encoder-decoder model for automatic playlist captioning. Compared to existing music captioning algorithms, PLAYN-TELL leverages also linguistic and musical knowledge to generate correct and thematic captions. We benchmark PLAYNTELL on a new editorial playlists dataset collected from two major music streaming services. PLAYN-TELL yields 2x-3x higher BLEU@4 and CIDEr than state of the art captioning algorithms.

show abstract

MusCaps: Generating Captions for Music Audio

Cited by 15 publications

References 35 publications

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

The Internet of Sounds: Convergent Trends, Insights, and Future Directions

Data-Efficient Playlist Captioning With Musical and Linguistic Knowledge

Contact Info

Product

Resources

About