Acoustic Scene Classification With Squeeze-Excitation Residual Networks

Naranjo-Alcazar, Javier; Perez-Castanos, Sergi; Zuccarello, Pedro; Cobos, Máximo

doi:10.1109/access.2020.3002761

Cited by 36 publications

(20 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We validate the great performance of ResNet28, as well as improve it using SE [24]. The latter have been used before in acoustic scene classification [28], albeit on a much smaller, 3-CNN layer VGG-style [28] architecture, comparable to the lower performers of our study.…”

Section: Related Worksupporting

confidence: 72%

“…We replicated them exactly based on the papers they were originally introduced, with two exceptions: i) We consistently observed a drop in performance if batch normalisation was used, even though we use double the batch size as [23], thus we do not use it anywhere, ii) the ResNet28, which is slightly adapted from the ResNet38 proposed in [23], and described in Sub-section 4.2. We consider the following core models in the comparison: a) wavCRNN [36]; a stack of 1-dimensional CNN layers applied on the raw audio waveform, followed by a stack of recurrent neural network layers (RNN), used in MIL based categorical emotion classification from speech, b) melCRNN [2]; a stack of 2-dimensional CNN layers followed by a stack of recurrent neural network layers (RNN), used in MIL based classification of Bornean gibbon calls, c) CNN-3 [17]; a simple model using CNN layers followed by average pooling used in MIL based audio tagging -it is of similar complexity to the model used in [28], d) VGG16 [37], e) CNN-14 [23], f) ResNet28 [23], and g) our proposed improvement SE-ResNet28. Table 3 summarises the results.…”

Section: Results -Core Model Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild

Rizos

Lawson²,

Han³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We study deep bioacoustic event detection through multi-head attention based pooling, exemplified by wildlife monitoring. In the multiple instance learning framework, a core deep neural network learns a projection of the input acoustic signal into a sequence of embeddings, each representing a segment of the input. Sequence pooling is then required to aggregate the information present in the sequence such that we have a single clip-wise representation. We propose an improvement based on Squeeze-and-Excitation mechanisms upon a recently proposed audio tagging ResNet, and show that it performs significantly better than the baseline, as well as a collection of other recent audio models. We then further enhance our model, by performing an extensive comparative study of recent sequence pooling mechanisms, and achieve our best result using multi-head selfattention followed by concatenation of the head-specific pooled embeddings -better than prediction pooling methods, as well as compared to other recent sequence pooling tricks. We perform these experiments on a novel dataset of spider monkey whinny calls we introduce here, recorded in a rainforest in the South-Pacific coast of Costa Rica, with a promising outlook pertaining to minimally invasive wildlife monitoring.

show abstract

Section: Related Worksupporting

confidence: 72%

Section: Results -Core Model Comparisonmentioning

confidence: 99%

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild

Rizos

Lawson²,

Han³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…The convolutional network trained with the audio information is composed of blocks defined as Conv-StandardPOST. These blocks were proposed in [9]. The aim of these blocks is to achieve improved accuracy by recalibrating the internal feature maps using residual [10] and squeeze-excitation techniques [11,12].…”

Section: Convolutional Neural Networkmentioning

confidence: 99%

“…The aim of these blocks is to achieve improved accuracy by recalibrating the internal feature maps using residual [10] and squeeze-excitation techniques [11,12]. For more insight about this choice, please see [9] where Conv-StandardPOST is fully explained and compared to other competing blocks. The architecture of the network can be seen in Table 1 2.3.…”

Section: Convolutional Neural Networkmentioning

confidence: 99%

Task 1A DCASE 2021: Acoustic Scene Classification with mismatch-devices using squeeze-excitation technique and low-complexity constraint

Naranjo-Alcazar¹,

Perez-Castanos²,

Cobos³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Acoustic scene classification (ASC) is one of the most popular problems in the field of machine listening. The objective of this problem is to classify an audio clip into one of the predefined scenes using only the audio data. This problem has considerably progressed over the years in the different editions of DCASE. It usually has several subtasks that allow to tackle this problem with different approaches. The subtask presented in this report corresponds to a ASC problem that is constrained by the complexity of the model as well as having audio recorded from different devices, known as mismatch devices (real and simulated). The work presented in this report follows the research line carried out by the team in previous years. Specifically, a system based on two steps is proposed: a two-dimensional representation of the audio using the Gamamtone filter bank and a convolutional neural network using squeeze-excitation techniques. The presented system outperforms the baseline by about 17 percentage points.

show abstract

“…Although audio signals are natively one-dimensional sequences, most state-of-the-art approaches to audio classification based on CNNs use a two-dimensional (2D) input [12,13]. Usually, these 2D inputs computed from the audio signal are well-known time-frequency representations such as Mel-spectrograms [14,15,16,17] or the output of constant-Q transform [18] (CQT) filterbanks, among others. Time-frequency 2D audio representations are able to accurately extract acoustically meaningful patterns but require a set of parameters to be specified, such as the window type and length, hop size or the number of frequency bins.…”

Section: Introductionmentioning

confidence: 99%

A Comparative Analysis of Residual Block Alternatives for End-to-End Audio Classification

Naranjo-Alcazar

Perez-Castanos²,

Martín-Morató

et al. 2020

IEEE Access

Self Cite

View full text Add to dashboard Cite

Residual learning is known for being a learning framework that facilitates the training of very deep neural networks. Residual blocks or units are made up of a set of stacked layers, where the inputs are added back to their outputs with the aim of creating identity mappings. In practice, such identity mappings are accomplished by means of the so-called skip or shortcut connections. However, multiple implementation alternatives arise with respect to where such skip connections are applied within the set of stacked layers making up a residual block. While residual networks for image classification using convolutional neural networks (CNNs) have been widely discussed in the literature, their adoption for 1D end-to-end architectures is still scarce in the audio domain. Thus, the suitability of different residual block designs for raw audio classification is partly unknown. The purpose of this paper is to compare, analyze and discuss the performance of several residual block implementations, the most commonly used in image classification problems, within a state-of-the-art CNN-based architecture for end-to-end audio classification using raw audio waveforms. Deep and careful statistical analyses over six different residual block alternatives are conducted, considering two well-known datasets and common input normalization choices. The results show that, while some significant differences in performance are observed among architectures using different residual block designs, the selection of the most suitable residual block can be highly dependent on the input data.

show abstract

Acoustic Scene Classification With Squeeze-Excitation Residual Networks

Cited by 36 publications

References 15 publications

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild

Task 1A DCASE 2021: Acoustic Scene Classification with mismatch-devices using squeeze-excitation technique and low-complexity constraint

A Comparative Analysis of Residual Block Alternatives for End-to-End Audio Classification

Contact Info

Product

Resources

About