ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9052990
|View full text |Cite
|
Sign up to set email alerts
|

Clotho: an Audio Captioning Dataset

Abstract: Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
138
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 173 publications
(139 citation statements)
references
References 11 publications
1
138
0
Order By: Relevance
“…[49][50][51][52][53][54][55][56]60,61]. Furthermore, recent audio and audiovisual captioning trends can offer additional semantic conceptualization meta-data [62][63][64][65]. These meta-information augmentation perspectives can accompany the above-discussed sustainable growth and well-being indicators, suggesting added-value innovative services for soundscape preservation and their engaging promotion at environmental, ecological, and heritage views.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…[49][50][51][52][53][54][55][56]60,61]. Furthermore, recent audio and audiovisual captioning trends can offer additional semantic conceptualization meta-data [62][63][64][65]. These meta-information augmentation perspectives can accompany the above-discussed sustainable growth and well-being indicators, suggesting added-value innovative services for soundscape preservation and their engaging promotion at environmental, ecological, and heritage views.…”
Section: Related Workmentioning
confidence: 99%
“…The proposed modular architecture allows the attachment of multi-channeled ambisonics sensors to the client terminal (i.e., soundfield microphones), to apply more sophisticated spatiotemporal localization and mapping that could facilitate the audiovisual content description and management [49][50][51]74,75]. On the other side, more demanding semantic analysis can be performed on a batch processing mode, as a cloud service, making use of recent advantages on Convolutional Neural Networks (CNN), Deep Learning (DL), and multimodal decisionmaking systems [58][59][60][61][62][63][64][65]. The focus here lies in the discrimination of time-concurrent audio events in a hierarchical classification taxonomy.…”
Section: Integration Of State-of-the-art Audio and Soundscape Semantimentioning
confidence: 99%
“…their corresponding physical properties, temporal information of these sound events, and their relationship with other events, and high-level knowledge-rich auditory understanding. For instance, a typical caption from the DCASE benchmark dataset Clotho [7] "people talking in a small and empty room" describes the sound event "people talking" and its global scene "in a room", where high-level auditory knowledge is processed to infer that the room is small and empty, a visual description.…”
Section: Introductionmentioning
confidence: 99%
“…Audio classification is a well-studied research field [1][2][3][4][5] with a wide variety of applications such as multimedia search and retrieval [4], urban sound monitoring [6], bioacoustic monitoring [7], and audio captioning [8]. Most recent audio classification methods employ a standard supervised learning approach applied to deep neural networks.…”
Section: Introductionmentioning
confidence: 99%