International Conference on Content-Based Multimedia Indexing 2022
DOI: 10.1145/3549555.3549568
|View full text |Cite
|
Sign up to set email alerts
|

An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification

Abstract: In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Shorttime Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditorybased filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on thre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 18 publications
0
3
0
Order By: Relevance
“…Rather than selecting a single task, we evaluate over a wide range of datasets of: Crowded-Scene [32], DCASE 2018 Task 1A and 1B [33], DCASE 2019 Task 1A and 1B [34], DCASE 2020 Task1A [35], DCASE 2021 Task 1A [35], and DCASE 2022 Tasks 1 [35]. We will see that the performance of our proposed system is competitive with the state-of-theart systems.…”
Section: )mentioning
confidence: 99%
See 2 more Smart Citations
“…Rather than selecting a single task, we evaluate over a wide range of datasets of: Crowded-Scene [32], DCASE 2018 Task 1A and 1B [33], DCASE 2019 Task 1A and 1B [34], DCASE 2020 Task1A [35], DCASE 2021 Task 1A [35], and DCASE 2022 Tasks 1 [35]. We will see that the performance of our proposed system is competitive with the state-of-theart systems.…”
Section: )mentioning
confidence: 99%
“…Crowded Scenes (Cr-Sc) [32]: contains 341 videos collected from YouTube (in-the-wild scenes), which presents a total recording time of nearly 29.06 hours. These videos were then split into 10-second video segments, each of which was annotated by one of five categories: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', or 'Sport-Atmosphere'.…”
Section: Evaluating Datasetsmentioning
confidence: 99%
See 1 more Smart Citation