A Transformer-Based Audio Captioning Model with Keyword Estimation

Masumura, Ryo; Nishida, K.; Yasuda, Masahiro; Saito, Shoichiro

doi:10.21437/interspeech.2020-2087

Cited by 47 publications

(46 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of these works also make use of attention mechanisms to align the audio and text modalities [14], [15], [18], [21]. More recently, following the success of self-attention in V&L models, a small body of work has also started exploring the use of Transformer-based models in audio captioning [19], [23].…”

Section: B Audio and Languagementioning

confidence: 99%

MusCaps: Generating Captions for Music Audio

Manco,

Benetos,

Quinton

et al. 2021

Preprint

View full text Add to dashboard Cite

Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audiotext inputs through a multimodal encoder and leverages pretraining on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices -modality fusion, decoding strategy and the use of attention -contribute only marginally. Our model represents a shift away from classificationbased music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval 1 .

show abstract

Section: B Audio and Languagementioning

confidence: 99%

MusCaps: Generating Captions for Music Audio

Manco,

Benetos,

Quinton

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…CNN-RNN [10] and CNN-Transformer [11] are the two dominant architectures which achieve state-of-the-art performance, while Transformer-only network also shows competitive performance [12]. To avoid the indeterminacy of word selection, keywords estimation was introduced as auxiliary information [13,14]. Koizumi et al [15] adopted a large pre-trained language model GPT-2 and audio-based similar caption…”

Section: Related Workmentioning

confidence: 99%

Diverse Audio Captioning via Adversarial Training

Mei¹,

Liu²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE), which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete tokens and the discrete sampling process is non-differentiable. To address this issue, policy gradient, a reinforcement learning technique, is used to back-propagate the reward to the generator. The results show that our proposed model can generate more diverse captions, as compared to state-of-the-art methods.

show abstract

“…To address this limitation, Transformer with an attention mechanism is introduced to model the global information within an audio signal and to capture temporal relationships between audio events, such as in [5], where a Transformer encoder is applied to estimate the keyword vectors from the audio embedding, and a Transformer decoder is used to predict the captions based on the keyword vectors and word embedding. Another encoder-decoder architecture based on Transformer is presented in [6], which directly extracts audio features rather than the keywords, using pretrained convolutional neural networks (CNNs) such as the pretrained audio neural networks (PANNs) [7].…”

Section: Introductionmentioning

confidence: 99%

Local Information Assisted Attention-free Decoder for Audio Captioning

Xiao,

Guan,

Lan

et al. 2022

Preprint

View full text Add to dashboard Cite

Automated audio captioning (AAC) aims to describe audio data with captions using natural language. Most existing AAC methods adopt an encoder-decoder structure, where the attention based mechanism is a popular choice in the decoder (e.g., Transformer decoder) for predicting captions from audio features. Such attention based decoders can capture the global information from the audio features, however, their ability in extracting local information can be limited, which may lead to degraded quality in the generated captions. In this paper, we present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction, and the attention-free decoder is designed to introduce local information. The proposed method enables the effective use of both global and local information from audio signals. Experiments show that our method outperforms the state-of-the-art methods with the standard attention based decoder in Task 6 of the DCASE 2021 Challenge.

show abstract

A Transformer-Based Audio Captioning Model with Keyword Estimation

Cited by 47 publications

References 27 publications

MusCaps: Generating Captions for Music Audio

MusCaps: Generating Captions for Music Audio

Diverse Audio Captioning via Adversarial Training

Local Information Assisted Attention-free Decoder for Audio Captioning

Contact Info

Product

Resources

About