Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2087
|View full text |Cite
|
Sign up to set email alerts
|

A Transformer-Based Audio Captioning Model with Keyword Estimation

Abstract: One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
46
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 47 publications
(46 citation statements)
references
References 27 publications
0
46
0
Order By: Relevance
“…Most of these works also make use of attention mechanisms to align the audio and text modalities [14], [15], [18], [21]. More recently, following the success of self-attention in V&L models, a small body of work has also started exploring the use of Transformer-based models in audio captioning [19], [23].…”
Section: B Audio and Languagementioning
confidence: 99%
“…Most of these works also make use of attention mechanisms to align the audio and text modalities [14], [15], [18], [21]. More recently, following the success of self-attention in V&L models, a small body of work has also started exploring the use of Transformer-based models in audio captioning [19], [23].…”
Section: B Audio and Languagementioning
confidence: 99%
“…CNN-RNN [10] and CNN-Transformer [11] are the two dominant architectures which achieve state-of-the-art performance, while Transformer-only network also shows competitive performance [12]. To avoid the indeterminacy of word selection, keywords estimation was introduced as auxiliary information [13,14]. Koizumi et al [15] adopted a large pre-trained language model GPT-2 and audio-based similar caption…”
Section: Related Workmentioning
confidence: 99%
“…To address this limitation, Transformer with an attention mechanism is introduced to model the global information within an audio signal and to capture temporal relationships between audio events, such as in [5], where a Transformer encoder is applied to estimate the keyword vectors from the audio embedding, and a Transformer decoder is used to predict the captions based on the keyword vectors and word embedding. Another encoder-decoder architecture based on Transformer is presented in [6], which directly extracts audio features rather than the keywords, using pretrained convolutional neural networks (CNNs) such as the pretrained audio neural networks (PANNs) [7].…”
Section: Introductionmentioning
confidence: 99%