2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2017
DOI: 10.1109/waspaa.2017.8170058
|View full text |Cite
|
Sign up to set email alerts
|

Automated audio captioning with recurrent neural networks

Abstract: We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
91
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
2
2
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 83 publications
(91 citation statements)
references
References 13 publications
0
91
0
Order By: Relevance
“…The encoder is a series of bi-directional gated recurrent units (bi-GRUs) [11], similarly to [4]. The output dimensionality for the GRU layers (forward and backward GRUs have same dimensionality) is {256, 256, 256}.…”
Section: Data Splittingmentioning
confidence: 99%
See 2 more Smart Citations
“…The encoder is a series of bi-directional gated recurrent units (bi-GRUs) [11], similarly to [4]. The output dimensionality for the GRU layers (forward and backward GRUs have same dimensionality) is {256, 256, 256}.…”
Section: Data Splittingmentioning
confidence: 99%
“…Baseline method and evaluationIn order to provide an example of how to employ Clotho and some initial (baseline) results, we use a previously utilized method for audio captioning[4] which is based on an encoder-decoder scheme with attention. The method accepts as an input a length-T sequence of 64 log mel-band energies X ∈ R T ×64 , which is used as an input to a DNN which outputs a probability distribution of words.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…For example, acoustic monitoring could detect physical events, such as glass breaking, a gun firing, tires skidding, or a car crashing. SED can also be incorporated into audio captioning for understanding social media content in more detail [2], audio monitoring in smart cities [3], life assistance and healthcare [4], etc.…”
Section: Introductionmentioning
confidence: 99%
“…To achieve human-like perception, using natural language to describe images( [6,7,8]) and videos has attracted much attention( [9,10,11]). Yet only little research has been made regarding audio scenes [12], which we think is due to the difference between visual Heinrich Dinkel is the co-first author. Kai Yu and Mengyue Wu are the corresponding authors.…”
Section: Introductionmentioning
confidence: 99%