2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462200
|View full text |Cite
|
Sign up to set email alerts
|

Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes

Abstract: In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data. We first describe a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data. Our model trains efficiently from audios of variable lengths; hence, it is well suited for transfer learning. We then propose methods to learn representations using this model which can be effectively used for solving the target task. We study both transductive… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
95
0
2

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 113 publications
(99 citation statements)
references
References 19 publications
(26 reference statements)
2
95
0
2
Order By: Relevance
“…This shows the importance of high-level data augmentation for extracting more discriminating features. Table 5 compares the performance of the proposed classification approach to the state-of-the-art pre-trained classifiers (AlexNet and GoogLeNet) on environmental sound datasets following the transfer learning and fine-tuning strategies explained in [75]. It is worth mentioning that, these two pre-trained networks have been fine-tuned on the 2D aggregation (pooling) of STFT, MFCC, and CRP.…”
Section: Aimentioning
confidence: 99%
“…This shows the importance of high-level data augmentation for extracting more discriminating features. Table 5 compares the performance of the proposed classification approach to the state-of-the-art pre-trained classifiers (AlexNet and GoogLeNet) on environmental sound datasets following the transfer learning and fine-tuning strategies explained in [75]. It is worth mentioning that, these two pre-trained networks have been fine-tuned on the 2D aggregation (pooling) of STFT, MFCC, and CRP.…”
Section: Aimentioning
confidence: 99%
“…Network design: The backbone network is based on the simple yet powerful CNN structure, which has been widely used in audio tasks [16,17]. The input feature of network is a mel-spectrograms, with 128 frequency bins and 160 frames.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Meanwhile, tagging the sound events consumes amounts of manpower. To address this problem, two kinds of approaches have been proposed: one efficiently making use of limited data [15][16][17][18][19] and the other based on additional data [3,20,21]. In the Figure 1: Architecture of the end-to-end audio classification network.…”
Section: Related Workmentioning
confidence: 99%
“…However, these augmentation methods are not suitable for a multi-label dataset, such as Audio Set. In the second type, Kumar et al [21] pre-trained the model by an additional large audio dataset. Kong et al [3] applied the rich sound representation learned on YouTube-100M [4] to classify Audio Set.…”
Section: Related Workmentioning
confidence: 99%