Audio Scene Classification with Deep Recurrent Neural Networks

Phan, Huy; Koch, Philipp; Katzberg, Fabrice; Maass, Marco; Mazur, Radoslaw; Mertins, Alfred

doi:10.21437/interspeech.2017-101

Cited by 45 publications

(58 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to the standard softmax, Support Vector Machines (SVM) usually achieve better generalization due to their maximum margin property [26]. Similar to [6,2], after training the network, we calibrate the final classifier by employing a linear SVM in replacement for the softmax layer. The trained network is used to extract feature vectors for the original training examples (without data augmentation) which are used to train the SVM classifier.…”

Section: Calibration With Support Vector Machinementioning

confidence: 99%

“…Note that the classification label of a 30-second recording was derived via aggregation of the classification results of its 2-second segments. To this end, probabilistic multiplicative fusion, followed by likelihood maximization were carried out similar to [2].…”

Section: Baselinementioning

confidence: 99%

“…Combining the classification results of the Att-CRNN on both types of feature using the probabilistic multiplicative aggregation [2] (i.e. Att-CRNN (fusion) in Table 3) further enlarges the margin up to 0.66% absolute gain on the overall accuracy, or…”

Section: Performance Comparison With State-of-the-artmentioning

confidence: 99%

“…Convolutional neural network (CNNs) [6,7,8,5,4,9,10,11] are most commonly used deep learning techniques, thanks to their feature learning capability. Leveraging the nature of audio signals, sequential modelling with recurrent neural networks (RNNs) [2,12,13] and temporal transformer networks [10] have also demonstrated results on par with the convolutional counterparts. The deep learning models were trained either on time-frequency representations, such as log Mel-scale spectrograms [4,11], or highlevel features like label tree embedding features [6,2].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Spatio-Temporal Attention Pooling for Audio Scene Classification

et al. 2019

Self Cite

View full text Add to dashboard Cite

Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with betweenclass examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.

show abstract

Section: Calibration With Support Vector Machinementioning

confidence: 99%

Section: Baselinementioning

confidence: 99%

Section: Performance Comparison With State-of-the-artmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Spatio-Temporal Attention Pooling for Audio Scene Classification

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…A popular approach for acoustic scene recognition (ASR) and the tagging task is to use the low-level or high-level acoustic features such as Mel-frequency cepstral coefficients (MFCCs), Mel-spectrogram, Mel-bank, log Mel-bank features, etc., with the state-of-the-art deep models [6,7]. Some of these acoustic features possess complementary qualities, that is, for two given features, one is apt in identifying certain specific classes, while the other is suitable for the rest.…”

Section: Introductionmentioning

confidence: 99%

Acoustic features fusion using attentive multi-channel deep architecture

Bhatt¹,

Gupta²,

Arora³

et al. 2018

5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018)

View full text Add to dashboard Cite

In this paper, we present a novel deep fusion architecture for audio classification tasks. The multi-channel model presented is formed using deep convolution layers where different acoustic features are passed through each channel. To enable dissemination of information across the channels, we introduce attention feature maps that aid in the alignment of frames. The output of each channel is merged using interaction parameters that non-linearly aggregate the representative features. Finally, we evaluate the performance of the proposed architecture on three benchmark datasets :-DCASE-2016 and LITIS Rouen (acoustic scene recognition), and CHiME-Home (tagging). Our experimental results suggest that the architecture presented outperforms the standard baselines and achieves outstanding performance on the task of acoustic scene recognition and audio tagging.

show abstract

Development and Analysis of Deep Learning Architectures

2020

Studies in Computational Intelligence

View full text Add to dashboard Cite

In this chapter, we compare deep learning and classical approaches for detection of baby cry sounds in various domestic environments under challenging signal-to-noise ratio conditions. Automatic cry detection has applications in commercial products (such as baby remote monitors) as well as in medical and psycho-social research. We design and evaluate several convolutional neural network (CNN) architectures for baby cry detection, and compare their performance to that of classical machine-learning approaches, such as logistic regression and support vector machines. In addition to feedforward CNNs, we analyze the performance of recurrent neural network (RNN) architectures, which are able to capture temporal behavior of acoustic events. We show that by carefully designing CNN architectures with specialized nonsymmetric kernels, better results are obtained compared to common CNN architectures.

show abstract

Audio Scene Classification with Deep Recurrent Neural Networks

Cited by 45 publications

References 19 publications

Spatio-Temporal Attention Pooling for Audio Scene Classification

Spatio-Temporal Attention Pooling for Audio Scene Classification

Acoustic features fusion using attentive multi-channel deep architecture

Development and Analysis of Deep Learning Architectures

Contact Info

Product

Resources

About