2020
DOI: 10.1109/access.2020.3002761
|View full text |Cite
|
Sign up to set email alerts
|

Acoustic Scene Classification With Squeeze-Excitation Residual Networks

Abstract: Acoustic scene classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location (e. g. park, airport, etc.). Many state-of-the-art solutions to ASC incorporate data augmentation techniques and model ensembles. However, considerable improvements can also be achieved only by modifying the architecture of convolutional neural networks (CNNs). In this work we propose two novel squeeze-excitation blocks to … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
18
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 36 publications
(20 citation statements)
references
References 15 publications
2
18
0
Order By: Relevance
“…We validate the great performance of ResNet28, as well as improve it using SE [24]. The latter have been used before in acoustic scene classification [28], albeit on a much smaller, 3-CNN layer VGG-style [28] architecture, comparable to the lower performers of our study.…”
Section: Related Worksupporting
confidence: 72%
See 1 more Smart Citation
“…We validate the great performance of ResNet28, as well as improve it using SE [24]. The latter have been used before in acoustic scene classification [28], albeit on a much smaller, 3-CNN layer VGG-style [28] architecture, comparable to the lower performers of our study.…”
Section: Related Worksupporting
confidence: 72%
“…We replicated them exactly based on the papers they were originally introduced, with two exceptions: i) We consistently observed a drop in performance if batch normalisation was used, even though we use double the batch size as [23], thus we do not use it anywhere, ii) the ResNet28, which is slightly adapted from the ResNet38 proposed in [23], and described in Sub-section 4.2. We consider the following core models in the comparison: a) wavCRNN [36]; a stack of 1-dimensional CNN layers applied on the raw audio waveform, followed by a stack of recurrent neural network layers (RNN), used in MIL based categorical emotion classification from speech, b) melCRNN [2]; a stack of 2-dimensional CNN layers followed by a stack of recurrent neural network layers (RNN), used in MIL based classification of Bornean gibbon calls, c) CNN-3 [17]; a simple model using CNN layers followed by average pooling used in MIL based audio tagging -it is of similar complexity to the model used in [28], d) VGG16 [37], e) CNN-14 [23], f) ResNet28 [23], and g) our proposed improvement SE-ResNet28. Table 3 summarises the results.…”
Section: Results -Core Model Comparisonmentioning
confidence: 99%
“…The convolutional network trained with the audio information is composed of blocks defined as Conv-StandardPOST. These blocks were proposed in [9]. The aim of these blocks is to achieve improved accuracy by recalibrating the internal feature maps using residual [10] and squeeze-excitation techniques [11,12].…”
Section: Convolutional Neural Networkmentioning
confidence: 99%
“…The aim of these blocks is to achieve improved accuracy by recalibrating the internal feature maps using residual [10] and squeeze-excitation techniques [11,12]. For more insight about this choice, please see [9] where Conv-StandardPOST is fully explained and compared to other competing blocks. The architecture of the network can be seen in Table 1 2.3.…”
Section: Convolutional Neural Networkmentioning
confidence: 99%
“…Although audio signals are natively one-dimensional sequences, most state-of-the-art approaches to audio classification based on CNNs use a two-dimensional (2D) input [12,13]. Usually, these 2D inputs computed from the audio signal are well-known time-frequency representations such as Mel-spectrograms [14,15,16,17] or the output of constant-Q transform [18] (CQT) filterbanks, among others. Time-frequency 2D audio representations are able to accurately extract acoustically meaningful patterns but require a set of parameters to be specified, such as the window type and length, hop size or the number of frequency bins.…”
Section: Introductionmentioning
confidence: 99%