Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-101
|View full text |Cite
|
Sign up to set email alerts
|

Audio Scene Classification with Deep Recurrent Neural Networks

Abstract: We introduce in this work an efficient approach for audio scene classification using deep recurrent neural networks. An audio scene is firstly transformed into a sequence of high-level label tree embedding feature vectors. The vector sequence is then divided into multiple subsequences on which a deep GRUbased recurrent neural network is trained for sequence-to-label classification. The global predicted label for the entire sequence is finally obtained via aggregation of subsequence classification outputs. We w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
57
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 45 publications
(58 citation statements)
references
References 19 publications
1
57
0
Order By: Relevance
“…Compared to the standard softmax, Support Vector Machines (SVM) usually achieve better generalization due to their maximum margin property [26]. Similar to [6,2], after training the network, we calibrate the final classifier by employing a linear SVM in replacement for the softmax layer. The trained network is used to extract feature vectors for the original training examples (without data augmentation) which are used to train the SVM classifier.…”
Section: Calibration With Support Vector Machinementioning
confidence: 99%
See 3 more Smart Citations
“…Compared to the standard softmax, Support Vector Machines (SVM) usually achieve better generalization due to their maximum margin property [26]. Similar to [6,2], after training the network, we calibrate the final classifier by employing a linear SVM in replacement for the softmax layer. The trained network is used to extract feature vectors for the original training examples (without data augmentation) which are used to train the SVM classifier.…”
Section: Calibration With Support Vector Machinementioning
confidence: 99%
“…Note that the classification label of a 30-second recording was derived via aggregation of the classification results of its 2-second segments. To this end, probabilistic multiplicative fusion, followed by likelihood maximization were carried out similar to [2].…”
Section: Baselinementioning
confidence: 99%
See 2 more Smart Citations
“…A popular approach for acoustic scene recognition (ASR) and the tagging task is to use the low-level or high-level acoustic features such as Mel-frequency cepstral coefficients (MFCCs), Mel-spectrogram, Mel-bank, log Mel-bank features, etc., with the state-of-the-art deep models [6,7]. Some of these acoustic features possess complementary qualities, that is, for two given features, one is apt in identifying certain specific classes, while the other is suitable for the rest.…”
Section: Introductionmentioning
confidence: 99%