ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682377
|View full text |Cite
|
Sign up to set email alerts
|

Audio Caption: Listen and Tell

Abstract: Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to auto… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
48
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1
1

Relationship

3
6

Authors

Journals

citations
Cited by 48 publications
(48 citation statements)
references
References 23 publications
0
48
0
Order By: Relevance
“…Recently, two different datasets for audio captioning were presented, Audio Caption and Audio-Caps [6,7]. Audio Caption is partially released, and contains 3710 domain-specific (hospital) video clips with their audio tracks, and annotations that were originally obtained in Mandarin Chinese and afterwards translated to English using machine translation [6]. The annotators had access and viewed the videos.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, two different datasets for audio captioning were presented, Audio Caption and Audio-Caps [6,7]. Audio Caption is partially released, and contains 3710 domain-specific (hospital) video clips with their audio tracks, and annotations that were originally obtained in Mandarin Chinese and afterwards translated to English using machine translation [6]. The annotators had access and viewed the videos.…”
Section: Introductionmentioning
confidence: 99%
“…The authors of [128] presented an unsupervised image captioning framework based on a new alignment method that allows the simultaneous integration of visual and textual streams through semantic learning of multimodal embeddings of the language and vision domains. Moreover, a multimodal model can also aggregate motion information [174], acoustic information [175], temporal information [176], etc. from successive frames to assign a caption for each one.…”
Section: Image Captioningmentioning
confidence: 99%
“…is the task of automatically generating human-like content description of an audio signal using free text. Recent progress has been focused on the development of caption datasets [2,3,4], in which novel algorithms [5,6,7] are cultivated and fostered rapidly. However, little attention has been addressed on the automatic evaluation metrics.…”
Section: Automated Audio Captioningmentioning
confidence: 99%