Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks

Phan, Huy; Hertel, Lars; Maass, Marco; Mertins, Alfred

doi:10.21437/interspeech.2016-123

Cited by 81 publications

(65 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) Selecting SELDnet output format: The output format for polyphonic SED in the literature has become standardized to estimating the temporal activity of each sound class using frame-wise binary numbers [31][32][33][34]. On the other hand, the output formats for DOA estimation are still being experimented with as seen in Table I.…”

Section: Methodsmentioning

confidence: 99%

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Adavanne¹,

Politis²,

Nikunen³

et al. 2019

IEEE J. Sel. Top. Signal Process.

401

378

View full text Add to dashboard Cite

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in threedimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-ofarrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method-and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

show abstract

Section: Methodsmentioning

confidence: 99%

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Adavanne¹,

Politis²,

Nikunen³

et al. 2019

IEEE J. Sel. Top. Signal Process.

401

378

View full text Add to dashboard Cite

show abstract

“…Previous work on sound events has been mostly focused on sound event classification, where audio clips consisting of sound events are classified. Apart from established classifiers-such as support vector machines [1], [3]-deep learning methods such as deep belief networks [7], convolutional neural networks (CNN) [8], [9], [10] and recurrent neural networks (RNN) [4], [11] have been recently proposed. Initially, the interest on SED was more focused on monophonic SED.…”

Section: Introductionmentioning

confidence: 99%

“…This approach integrates the strengths of both CNNs and RNNs, which have shown excellent performance in acoustic pattern recognition applications [4], [8], [9], [10], while overcoming their individual weaknesses. We evaluate the proposed method on three datasets of real-life recordings and compare its performance to FNN, CNN, RNN and GMM baselines.…”

Section: Introductionmentioning

confidence: 99%

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Çakır

Parascandolo

Heittola

et al. 2017

IEEE/ACM Trans. Audio Speech Lang. Process.

474

337

View full text Add to dashboard Cite

Abstract-Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

show abstract

“…Or for a spectrogram front-end, it is used vertical filters to learn spectral representations [26] or horizontal filters to learn longer temporal cues [46]. Generally, a single filter shape is used in the first CNN layer [6,9,26,46], but some recent work reported performance gains when using several filter shapes in the first layer [5,34,36,38,39,53]. Using many filters promotes a more rich feature extraction in the first layer, and facilitates leveraging domain knowledge for designing the filters' shape.…”

Section: Architecturesmentioning

confidence: 99%

Randomly Weighted CNNs for (Music) Audio Classification

Pons

Serra

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The computer vision literature shows that randomly weighted neural networks perform reasonably as feature extractors. Following this idea, we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks. We use features extracted from the embeddings of deep architectures as input to a classifier -with the goal to compare classification accuracies when using different randomly weighted architectures. By following this methodology, we run a comprehensive evaluation of the current deep architectures for audio classification, and provide evidence that the architectures alone are an important piece for resolving (music) audio problems using deep neural networks.

show abstract

Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks

Cited by 81 publications

References 15 publications

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Randomly Weighted CNNs for (Music) Audio Classification

Contact Info

Product

Resources

About