2022 30th European Signal Processing Conference (EUSIPCO) 2022
DOI: 10.23919/eusipco55093.2022.9909680
|View full text |Cite
|
Sign up to set email alerts
|

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Image QA datasets : recent survey on visual QA and visual reasoning [82] provides a full list of images/visual question-answering (VQA) including reasoning tasks. Audio QA datasets : DAQA [83] on audio temporal reasoning, Clotho-AQA [84] on binary and multichoice audio QA. Video QA datasets : such as VideoQA [85] for multi-domain, MovieQA [86]/MovieFIB [87]/TVQA [88]/KnowIT VQA [89] for movies and shows, MarioQA [90] for games, PororoQA [91] for cartoons, TurorialVQA [92] for tutorials, CLEVRER [93] for physical & causal reasoning.…”
Section: Datasetsmentioning
confidence: 99%
“…Image QA datasets : recent survey on visual QA and visual reasoning [82] provides a full list of images/visual question-answering (VQA) including reasoning tasks. Audio QA datasets : DAQA [83] on audio temporal reasoning, Clotho-AQA [84] on binary and multichoice audio QA. Video QA datasets : such as VideoQA [85] for multi-domain, MovieQA [86]/MovieFIB [87]/TVQA [88]/KnowIT VQA [89] for movies and shows, MarioQA [90] for games, PororoQA [91] for cartoons, TurorialVQA [92] for tutorials, CLEVRER [93] for physical & causal reasoning.…”
Section: Datasetsmentioning
confidence: 99%
“…The second stage of audio-language training aims to enhance the versatility and applicability of our model through multi-task learning, involving a variety of datasets tailored to different audio processing tasks. ClothoAQA (Lipping et al, 2022b), with about 1,500 entries, is utilized for refining our model's capabilities in question answering based on audio cues, with each sample enriched by six associated questions sourced through crowdsourcing. The instruction tuning phase also capitalizes on the continued use of WavCaps (Mei et al, 2023), alongside AudioCaps (Kim et al, 2019) and Clotho (Drossos et al, 2020), which together contribute about 453,000 audio samples for audio-text training.…”
Section: Multi-task Fine-tuningmentioning
confidence: 99%
“…AQA For the AQA task, we select open-ended Clotho-AQA (Lipping et al, 2022a) and multiple-choice TUT2017 (Mesaros et al, 2016) dataset as our benchmark and report accuracy as the evaluation metric.…”
Section: Evaluation Benchmarksmentioning
confidence: 99%
“…More recently, there has been a surge of interest in establishing a more profound comprehension of audio content by connecting audio and language. A number of audio-language (AL) multimodal learning tasks have been introduced, such as textto-audio retrieval [8], [9], automated audio captioning [10]- [12], audio question answering [13], [14], text-based sound generation [15]- [18], and text-based sound separation [19].…”
Section: Introductionmentioning
confidence: 99%