Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-123
|View full text |Cite
|
Sign up to set email alerts
|

Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks

Abstract: We present in this paper a simple, yet efficient convolutional neural network (CNN) architecture for robust audio event recognition. Opposing to deep CNN architectures with multiple convolutional and pooling layers topped up with multiple fully connected layers, the proposed network consists of only three layers: convolutional, pooling, and softmax layer. Two further features distinguish it from the deep architectures that have been proposed for the task: varying-size convolutional filters at the convolutional… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
63
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 81 publications
(65 citation statements)
references
References 15 publications
2
63
0
Order By: Relevance
“…2) Selecting SELDnet output format: The output format for polyphonic SED in the literature has become standardized to estimating the temporal activity of each sound class using frame-wise binary numbers [31][32][33][34]. On the other hand, the output formats for DOA estimation are still being experimented with as seen in Table I.…”
Section: Methodsmentioning
confidence: 99%
“…2) Selecting SELDnet output format: The output format for polyphonic SED in the literature has become standardized to estimating the temporal activity of each sound class using frame-wise binary numbers [31][32][33][34]. On the other hand, the output formats for DOA estimation are still being experimented with as seen in Table I.…”
Section: Methodsmentioning
confidence: 99%
“…Previous work on sound events has been mostly focused on sound event classification, where audio clips consisting of sound events are classified. Apart from established classifiers-such as support vector machines [1], [3]-deep learning methods such as deep belief networks [7], convolutional neural networks (CNN) [8], [9], [10] and recurrent neural networks (RNN) [4], [11] have been recently proposed. Initially, the interest on SED was more focused on monophonic SED.…”
Section: Introductionmentioning
confidence: 99%
“…This approach integrates the strengths of both CNNs and RNNs, which have shown excellent performance in acoustic pattern recognition applications [4], [8], [9], [10], while overcoming their individual weaknesses. We evaluate the proposed method on three datasets of real-life recordings and compare its performance to FNN, CNN, RNN and GMM baselines.…”
Section: Introductionmentioning
confidence: 99%
“…Or for a spectrogram front-end, it is used vertical filters to learn spectral representations [26] or horizontal filters to learn longer temporal cues [46]. Generally, a single filter shape is used in the first CNN layer [6,9,26,46], but some recent work reported performance gains when using several filter shapes in the first layer [5,34,36,38,39,53]. Using many filters promotes a more rich feature extraction in the first layer, and facilitates leveraging domain knowledge for designing the filters' shape.…”
Section: Architecturesmentioning
confidence: 99%