2017
DOI: 10.1109/taslp.2017.2690575
|View full text |Cite
|
Sign up to set email alerts
|

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Abstract: Abstract-Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
382
0
3

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
4
1

Relationship

2
7

Authors

Journals

citations
Cited by 476 publications
(386 citation statements)
references
References 36 publications
1
382
0
3
Order By: Relevance
“…These features has been shown to perform well in various audio tagging and sound event detection tasks [12,13,9]. First, we obtained the magnitude spectrum of the audio signals by using short-time Fourier transform (STFT) over 40 ms audio frames of 50% overlap, windowed with Hamming window.…”
Section: Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…These features has been shown to perform well in various audio tagging and sound event detection tasks [12,13,9]. First, we obtained the magnitude spectrum of the audio signals by using short-time Fourier transform (STFT) over 40 ms audio frames of 50% overlap, windowed with Hamming window.…”
Section: Featuresmentioning
confidence: 99%
“…In this work, we combine these two approaches in a convolutional recurrent neural network (CRNN) and apply it over spectral acoustic features for the BAD challenge. This method consists of slight modification (temporal maxpooling to obtain file-level estimation instead of frame-level estimation) and hyperparameter fine-tuning for the challenge over the CRNN proposed in [9], where it has provided state- of-the-art results on various polyphonic sound event detection and audio tagging tasks. Similar approaches combining CNNs and RNNs have been presented recently in ASR [10] and music classification [11].…”
Section: Introductionmentioning
confidence: 99%
“…SED can be applied to many areas related to machine listening, such as traffic monitoring [1], smart meeting room [2], automatic assistance driving [3], and multimedia analysis [4]. The popular classifiers for SED include deep models, such as CRNNs [5,6], recurrent neural networks (RNNs) [7,8], convolutional neural networks (CNNs) [9][10]; and traditional shallow models, such as random regression forests [11], support vector machines [12][13][14], hidden Markov models [15], and Gaussian mixture models [16].…”
Section: Introductionmentioning
confidence: 99%
“…Currently, due to the excellent performance of neural networks, most SED systems are based on neural networks [1,2]. Especially, Phan et al [3] and Xia et al [4] propose methods for SED combining multi-task learning and neural networks, in which audio tagging and event boundary detection are identified as two separate tasks.…”
Section: Introductionmentioning
confidence: 99%