Automated audio captioning with recurrent neural networks

Drossos, Konstantinos; Adavanne, Sharath; Virtanen, Tuomas

doi:10.1109/waspaa.2017.8170058

Cited by 83 publications

(91 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The encoder is a series of bi-directional gated recurrent units (bi-GRUs) [11], similarly to [4]. The output dimensionality for the GRU layers (forward and backward GRUs have same dimensionality) is {256, 256, 256}.…”

Section: Data Splittingmentioning

confidence: 99%

“…Baseline method and evaluationIn order to provide an example of how to employ Clotho and some initial (baseline) results, we use a previously utilized method for audio captioning[4] which is based on an encoder-decoder scheme with attention. The method accepts as an input a length-T sequence of 64 log mel-band energies X ∈ R T ×64 , which is used as an input to a DNN which outputs a probability distribution of words.…”

mentioning

confidence: 99%

“…The method accepts as an input a length-T sequence of 64 log mel-band energies X ∈ R T ×64 , which is used as an input to a DNN which outputs a probability distribution of words. The generated caption is constructed from the output of the DNN, as in[4]. We optimize the parameters of the method using the development split of Clotho and we evaluate it using the evaluation and the testing splits, separately.We first extract 64 log mel-band energies, using a Hamming window of 46 ms, with 50% overlap.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Clotho: an Audio Captioning Dataset

Drossos

Lipping

Virtanen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

172

127

View full text Add to dashboard Cite

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

show abstract

Section: Data Splittingmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Clotho: an Audio Captioning Dataset

Drossos

Lipping

Virtanen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

172

127

View full text Add to dashboard Cite

show abstract

“…For example, acoustic monitoring could detect physical events, such as glass breaking, a gun firing, tires skidding, or a car crashing. SED can also be incorporated into audio captioning for understanding social media content in more detail [2], audio monitoring in smart cities [3], life assistance and healthcare [4], etc.…”

Section: Introductionmentioning

confidence: 99%

Polyphonic Sound Event Detection Based on Residual Convolutional Recurrent Neural Network With Semi-Supervised Loss Function

Kim

2021

IEEE Access

View full text Add to dashboard Cite

This report proposes a polyphonic sound event detection (SED) method for the DCASE 2020 Challenge Task 4. The proposed SED method is based on semi-supervised learning to deal with the different combination of training datasets such as weakly labeled dataset, unlabeled dataset, and strongly labeled synthetic dataset. Especially, the target label of each audio clip from weakly labeled or unlabeled dataset is first predicted by using the mean teacher model that is the DCASE 2020 baseline. The data with predicted labels are used for training the proposed SED model, which consists of CNNs with skip connections and self-attention mechanism, followed by RNNs. In order to compensate for the erroneous prediction of weakly labeled and unlabeled data, a semi-supervised loss function is employed for the proposed SED model. In this work, several versions of the proposed SED model are implemented and evaluated on the validation set according to the different parameter setting for the semi-supervised loss function, and then an ensemble model that combines five-fold validation models is finally selected as our final model.

show abstract

“…To achieve human-like perception, using natural language to describe images( [6,7,8]) and videos has attracted much attention( [9,10,11]). Yet only little research has been made regarding audio scenes [12], which we think is due to the difference between visual Heinrich Dinkel is the co-first author. Kai Yu and Mengyue Wu are the corresponding authors.…”

Section: Introductionmentioning

confidence: 99%

Audio Caption: Listen and Tell

Dinkel

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.

show abstract

Automated audio captioning with recurrent neural networks

Cited by 83 publications

References 13 publications

Clotho: an Audio Captioning Dataset

Clotho: an Audio Captioning Dataset

Polyphonic Sound Event Detection Based on Residual Convolutional Recurrent Neural Network With Semi-Supervised Loss Function

Audio Caption: Listen and Tell

Contact Info

Product

Resources

About