2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7952190
|View full text |Cite
|
Sign up to set email alerts
|

Very deep convolutional neural networks for raw waveforms

Abstract: Learning acoustic models directly from the raw waveform data with minimal processing is challenging. Current waveform-based models have generally used very few (∼2) convolutional layers, which might be insufficient for building high-level discriminative features. In this work, we propose very deep convolutional neural networks (CNNs) that directly use time-domain waveforms as inputs. Our CNNs, with up to 34 weight layers, are efficient to optimize over very long sequences (e.g., vector of size 32000), necessar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
252
1
2

Year Published

2018
2018
2019
2019

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 264 publications
(257 citation statements)
references
References 19 publications
2
252
1
2
Order By: Relevance
“…The majority of them used large-sized filters in the first convolutional layer with various size of strides to capture frequency-selective responses which were carefully designed to handle their target problems. We termed this approach as frame-level raw waveform model because the filter and stride sizes of the first convolutional layer were chosen to be comparable to the window and the hop sizes of short-time Fourier transformation, respectively [5][6][7][8][9][10][11].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The majority of them used large-sized filters in the first convolutional layer with various size of strides to capture frequency-selective responses which were carefully designed to handle their target problems. We termed this approach as frame-level raw waveform model because the filter and stride sizes of the first convolutional layer were chosen to be comparable to the window and the hop sizes of short-time Fourier transformation, respectively [5][6][7][8][9][10][11].…”
Section: Related Workmentioning
confidence: 99%
“…These spectral representations have served a similar role to the word embedding in the language model in that the mid-level representation are computed separately from the learning model and they are not particularly optimized for the target task. This issue has been addressed by taking raw waveforms directly as input in different audio tasks, for example, speech recognition [5][6][7], music classification [8][9][10] and acoustic scene classification [11,12].…”
Section: Introductionmentioning
confidence: 99%
“…The CEMD architecture is graphically depicted in Figure . The encoding step in CEMD is a convolutional neural network (CNN) composed of sequential layers of convolution, pooling, and rectification . Each of these layers have parameterized weights that can collectively be referred to as W ; therefore, the CNN with weights W can be conceived of as applying a series of transformations that perform the encoding function f:R512Rk defined by fs,W=θ.…”
Section: Methodsmentioning
confidence: 99%
“…Several techniques for feature extraction based on image processing methods, which handle the time‐frequency data of sound signals as images, have also been proposed . In addition, because deep learning has great potential for handling high‐dimensional data and for jointly optimizing feature extraction and statistical models in end‐to‐end approaches, techniques that directly use the waveform amplitude values of sound signals as input have also been developed .…”
Section: Back‐end Techniques For Environmental Sound Processingmentioning
confidence: 99%