Crowdsourcing a Dataset of Audio Captions

Lipping, Samuel; Drossos, Konstantinos; Virtanen, Tuomas

doi:10.48550/arxiv.1907.09238

Cited by 1 publication

(2 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The audio captioning task is firstly introduced in [1], which proposed the commercial ProSound Effects [6] audio corpus as a proof of concept. The paper proposed a BiGRU [7] based encoder-decoder model to generate audio captions.…”

Section: Related Workmentioning

confidence: 99%

“…Even for people, precisely distinguishing events in audio can be difficult, let alone effectively describing the contents of given audio, because the description is often dependent on the situation or context as much as the audio itself. Therefore, due to the ambiguity of audio, different persons may have varying perceptions of the same audio, which will result in the semantic disparity of audio captions [2], for example, a thin plastic rattling could be perceived as a fire crackling [6] (as shown in Fig. 1).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Caption Feature Space Regularization for Audio Captioning

Zhang¹,

Yang²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github. com/PRIS-CV/Caption-Feature-Space-Regularization.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%