2019
DOI: 10.48550/arxiv.1912.10211
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Abstract: Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification and sound event detection. Recently neural networks have been applied to solve audio pattern recognition problems. However, previous systems focus on small datasets, which limits the performance of audio pattern recognition systems. Recently in computer vision and natural language processing, systems pretrained on large datasets have generalized … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
1

Year Published

2020
2020
2021
2021

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(46 citation statements)
references
References 42 publications
0
45
1
Order By: Relevance
“…b. Audio Modality: Following the aforementioned procedure of our visual modality comparison, Table 4 presents the results of using pre-trained audio features [25] compared with audio features learned using ShotCoL on Ad-Cuepoints dataset. Results using different number of shots are presented showing that ShotCoL is able to outperform existing approach by a sizable margin, demonstrating its effectiveness on audio modality.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…b. Audio Modality: Following the aforementioned procedure of our visual modality comparison, Table 4 presents the results of using pre-trained audio features [25] compared with audio features learned using ShotCoL on Ad-Cuepoints dataset. Results using different number of shots are presented showing that ShotCoL is able to outperform existing approach by a sizable margin, demonstrating its effectiveness on audio modality.…”
Section: Resultsmentioning
confidence: 99%
“…To extract the audio embedding from each shot, we use a Wavegram-Logmel CNN [25] which incorporates a 14-layer CNN similar in architecture to the VGG [17] network. We sample 10-second mono audio samples at a rate of 32 kHz from each shot.…”
Section: -Audio Modalitymentioning
confidence: 99%
See 1 more Smart Citation
“…We use R(2+1)D-18 [58] as the video student network. The audio and image teacher networks are the 1D-CNN14 [35] and 2D-ResNet34 [28], pretrained on the AudioSet [22] and ImageNet [15]. The model weights of the teacher networks are kept frozen during training.…”
Section: Methodsmentioning
confidence: 99%
“…[5] and Lasseck [6,7] introduced deep learning techniques for the "Bird species identification in soundscapes" problem. State-of-the-art solutions are based on Deep Convolutional Neural Networks (CNNs) [8,9,10], usually, deep CNNs with attention mechanisms are selected as backbone in these experiments [11,12,13,14,15]. Pretrained audio neural networks (PANNs) [14] provide a multi-task state-of-the-art baseline for audio related tasks, in previous competitions these networks proved their generalization capability.…”
Section: Related Workmentioning
confidence: 99%