2021
DOI: 10.48550/arxiv.2110.06100
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…Mei et al [19] proposed a full Transformer-based audio captioning method to improve the capability of modelling global and fine-grained temporal information. Ye et al [20] proposed a fully supervised audio captioning model based on the multi-modal attention module, which utilizes acoustic and semantic information to generate captions. Xu et al [21] pre-trained the audio encoder on textaudio retrieval tasks, enhancing the representation capability of the audio encoder for audio captioning.…”
Section: B Fully Supervised Audio Captioningmentioning
confidence: 99%
See 2 more Smart Citations
“…Mei et al [19] proposed a full Transformer-based audio captioning method to improve the capability of modelling global and fine-grained temporal information. Ye et al [20] proposed a fully supervised audio captioning model based on the multi-modal attention module, which utilizes acoustic and semantic information to generate captions. Xu et al [21] pre-trained the audio encoder on textaudio retrieval tasks, enhancing the representation capability of the audio encoder for audio captioning.…”
Section: B Fully Supervised Audio Captioningmentioning
confidence: 99%
“…Fully Supervised Audio Captioning: We compare our method with fully supervised audio captioning methods: ACT [19], MAAC [20], Xu et al [21], Prefix AAC [22], RLSSR [23], RECAP [24], and ACTUAL [7]. All of which are open source and not trained with additional data.…”
Section: Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…CLIP-AAC [27] adds a text encoder to the conventional encoder-decoder architecture for contrastive learning. MAAC [6] employs an LSTM-based multimodal attention decoder to incorporate both the acoustic and the semantic information. P-LocalAFT [7] employs CNN-10 from PANN [28] as the encoder and an local information assisted attention-free transformer as the decoder.…”
Section: Setupmentioning
confidence: 99%
“…In order to address this challenge, prior works have extensively used the transfer learning framework. Some works employed audio tagging or sound event detection as the pretraining task for the audio encoder [5,6,7,8,9]. Others utilized pretrained language models such as GPT-2 [10,11,12,13] and BART [14,15] as the text decoder to enhance the semantic quality of the captions.…”
Section: Introductionmentioning
confidence: 99%