Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-831
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification

Abstract: In this paper, we propose to use Convolutional Restricted Boltzmann Machine (ConvRBM) to learn filterbank from the raw audio signals. ConvRBM is a generative model trained in an unsupervised way to model the audio signals of arbitrary lengths. ConvRBM is trained using annealed dropout technique and parameters are optimized using Adam optimization. The subband filters of ConvRBM learned from the ESC-50 database resemble Fourier basis in the mid-frequency range while some of the low-frequency subband filters res… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0
3

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 72 publications
(31 citation statements)
references
References 27 publications
0
28
0
3
Order By: Relevance
“…96.7 ------ [39] ------88.50 [46] ------84.90 [35] ------86.50 [36] ------83.50 [38] ------83.50 [37] ---…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…96.7 ------ [39] ------88.50 [46] ------84.90 [35] ------86.50 [36] ------83.50 [38] ------83.50 [37] ---…”
Section: Resultsmentioning
confidence: 99%
“…ESC-50 also contains unlabeled samples for unsupervised learning. Sailor et al [35] have applied deep CNNs to this classification task and have achieved results superior to human classification. Li et al [36] proposed a neural network, which took raw audio signals, spectrograms, and audio features as input, and combined it with an attention mechanism that was trained end-toend with the rest of the network.…”
Section: Introductionmentioning
confidence: 99%
“…In addition, TS-CNN10-concat and ST-CNN10-concat apply the attention mechanisms of temporalspectral concatenation in Figure 2(a) and spectral-temporal concatenation in Figure 2(b), respectively. [19] 80.5% 64.9% 73.0% EnvNet-v2 [29] 91.3% 84.9% 78.3% SB-CNN [30] 91.7% 83.9% 83.7% GTSC+TEO-GTSC [31] -81.9% 88.0% ConvRBM+FBEs [32] -86.5% -ACRNN [12] 93.7% 86.1% -Multi-Stream CNN [33] 94.2% 84.0% -MelFB+LGTFB-EN-CNN [34] 93.7% 88.1% 85.8% Human [3] 95.7% 81.3% -CNN10 [23] 92 [21] 59.7% 62.5% CNN10 [23] 68.1%…”
Section: Network Structuresmentioning
confidence: 99%
“…The moment parameters of Adam optimization was chosen to be β1=0.5, and β2=0.999. The annealing dropout probability was chosen to be 0.3 based on our earlier experiments in the ASR [20] and environmental sound classification [30]. After the model was trained, the features were extracted from the speech signal as discussed in Section 2.2.…”
Section: Training Of Convrbm and Feature Extractionmentioning
confidence: 99%