PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Kong, Qiuqiang; Cao, Yin; Iqbal, Turab; Wang, Yu-Xuan; Wang, Wenwu; Plumbley, Mark D.

doi:10.48550/arxiv.1912.10211

Cited by 24 publications

(46 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…b. Audio Modality: Following the aforementioned procedure of our visual modality comparison, Table 4 presents the results of using pre-trained audio features [25] compared with audio features learned using ShotCoL on Ad-Cuepoints dataset. Results using different number of shots are presented showing that ShotCoL is able to outperform existing approach by a sizable margin, demonstrating its effectiveness on audio modality.…”

Section: Resultsmentioning

confidence: 99%

“…To extract the audio embedding from each shot, we use a Wavegram-Logmel CNN [25] which incorporates a 14-layer CNN similar in architecture to the VGG [17] network. We sample 10-second mono audio samples at a rate of 32 kHz from each shot.…”

Section: -Audio Modalitymentioning

confidence: 99%

“…For shots longer than 10 seconds, we extract a 10-second window from the center. These inputs are provided to the Wavegram-Logmel network [25] to extract a 2048-dimensional feature vector for each shot.…”

Section: -Audio Modalitymentioning

confidence: 99%

See 2 more Smart Citations

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

Chen¹,

Nie²,

Fan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Approach Overview -Representative frames of 10 shots from 2 different scenes of the movie Stuart Little are shown. The story-arch of each scene is distinguishable and semantically coherent. We consider similar nearby shots (e.g. 5 and 3) as augmented versions of each other. This augmentation approach is able to capitalize on the underlying film-production process and can encode the scenestructure better than the existing augmentation methods. Given a current shot (query) we find a similar shot (key) within its neighborhood and: (a) maximize the similarity between the query and the key, and (b) minimize the similarity of the query with randomly selected shots.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: -Audio Modalitymentioning

confidence: 99%

See 1 more Smart Citation

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

Chen¹,

Nie²,

Fan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We use R(2+1)D-18 [58] as the video student network. The audio and image teacher networks are the 1D-CNN14 [35] and 2D-ResNet34 [28], pretrained on the AudioSet [22] and ImageNet [15]. The model weights of the teacher networks are kept frozen during training.…”

Section: Methodsmentioning

confidence: 99%

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Chen¹,

Xian²,

Koepke³

et al. 2021

Preprint

View full text Add to dashboard Cite

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. Our main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VG-GSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. Code is released here: https://github.com/yanbeic/CCL.

show abstract

“…[5] and Lasseck [6,7] introduced deep learning techniques for the "Bird species identification in soundscapes" problem. State-of-the-art solutions are based on Deep Convolutional Neural Networks (CNNs) [8,9,10], usually, deep CNNs with attention mechanisms are selected as backbone in these experiments [11,12,13,14,15]. Pretrained audio neural networks (PANNs) [14] provide a multi-task state-of-the-art baseline for audio related tasks, in previous competitions these networks proved their generalization capability.…”

Section: Related Workmentioning

confidence: 99%

Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. A BirdCLEF 2021 Solution

Conde,

Shubham,

Agnihotri

et al. 2021

Preprint

View full text Add to dashboard Cite

It is easier to hear birds than see them, however, they still play an essential role in nature and they are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Machine Learning and Convolutional Neural Networks allow us to detect and classify bird sounds, by doing this, we can assist researchers in monitoring the status and trends of bird populations and biodiversity in ecosystems. We propose a sound detection and classification pipeline for analyzing complex soundscape recordings and identify birdcalls in the background. Our pipeline learns from weak labels, classifies fine-grained bird vocalizations in the wild, and is robust against background sounds (e.g., airplanes, rain, etc). Our solution achieved 10th place of 816 teams at the BirdCLEF 2021 Challenge hosted on Kaggle. Code and models will be open-sourced at https://github.com/kumar-shubham-ml/kaggle-birdclef-2021.

show abstract

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Cited by 24 publications

References 42 publications

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. A BirdCLEF 2021 Solution

Contact Info

Product

Resources

About