Very deep convolutional neural networks for raw waveforms

Dai, Wei; Dai, Chia; Qu, Shuhui; Li, Juncheng; Das, Samarjit

doi:10.1109/icassp.2017.7952190

Cited by 264 publications

(257 citation statements)

References 19 publications

Supporting

Mentioning

252

Contrasting

Unclassified

Order By: Relevance

“…The majority of them used large-sized filters in the first convolutional layer with various size of strides to capture frequency-selective responses which were carefully designed to handle their target problems. We termed this approach as frame-level raw waveform model because the filter and stride sizes of the first convolutional layer were chosen to be comparable to the window and the hop sizes of short-time Fourier transformation, respectively [5][6][7][8][9][10][11].…”

Section: Related Workmentioning

confidence: 99%

“…These spectral representations have served a similar role to the word embedding in the language model in that the mid-level representation are computed separately from the learning model and they are not particularly optimized for the target task. This issue has been addressed by taking raw waveforms directly as input in different audio tasks, for example, speech recognition [5][6][7], music classification [8][9][10] and acoustic scene classification [11,12].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification

et al. 2018

View full text Add to dashboard Cite

Convolutional Neural Networks (CNN) have been applied to diverse machine learning tasks for different modalities of raw data in an end-to-end fashion. In the audio domain, a raw waveform-based approach has been explored to directly learn hierarchical characteristics of audio. However, the majority of previous studies have limited their model capacity by taking a frame-level structure similar to short-time Fourier transforms. We previously proposed a CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations. The architecture showed comparable performance to the spectrogram-based CNN model in music auto-tagging. In this paper, we extend the previous work in three ways. First, considering the sample-level model requires much longer training time, we progressively downsample the input signals and examine how it affects the performance. Second, we extend the model using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks. Finally, we visualize filters learned by the sample-level CNN in each layer to identify hierarchically learned features and show that they are sensitive to log-scaled frequency.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification

et al. 2018

View full text Add to dashboard Cite

show abstract

“…The CEMD architecture is graphically depicted in Figure . The encoding step in CEMD is a convolutional neural network (CNN) composed of sequential layers of convolution, pooling, and rectification . Each of these layers have parameterized weights that can collectively be referred to as W ; therefore, the CNN with weights W can be conceived of as applying a series of transformations that perform the encoding function

f : R^{512} \to R^{k}

defined by

f (s, W) = θ

.…”

Section: Methodsmentioning

confidence: 99%

Incorporation of a spectral model in a convolutional neural network for accelerated spectral fitting

Gurbani

Sheriff

Maudsley

et al. 2019

Magnetic Resonance in Med

View full text Add to dashboard Cite

Purpose MRSI has shown great promise in the detection and monitoring of neurologic pathologies such as tumor. A necessary component of data processing includes the quantitation of each metabolite, typically done through fitting a model of the spectrum to the data. For high‐resolution volumetric MRSI of the brain, which may have ~10,000 spectra, significant processing time is required for spectral analysis and generation of metabolite maps. Methods A novel unsupervised deep learning architecture that combines a convolutional neural network with a priori models of the spectrum is presented. This architecture, a convolutional encoder–model decoder (CEMD), combines the strengths of adaptive and unbiased convolutional networks with models of magnetic resonance and is readily interpretable. Results The CEMD architecture performs accurate spectral fitting for volumetric MRSI in patients with glioblastoma, provides whole‐brain fitting in 1 min on a standard computer, and handles a variety of spectral artifacts. Conclusion A new architecture combining physics domain knowledge with convolutional neural networks has been developed and is able to perform rapid spectral fitting of whole‐brain data. Rapid processing is a critical step toward routine clinical practice.

show abstract

“…Several techniques for feature extraction based on image processing methods, which handle the time‐frequency data of sound signals as images, have also been proposed . In addition, because deep learning has great potential for handling high‐dimensional data and for jointly optimizing feature extraction and statistical models in end‐to‐end approaches, techniques that directly use the waveform amplitude values of sound signals as input have also been developed .…”

Section: Back‐end Techniques For Environmental Sound Processingmentioning

confidence: 99%

Environmental sound processing and its applications

Miyazaki

Toda

Hayashi

et al. 2019

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

As part of the effort to develop techniques for understanding environments using sound, many studies in the field of computational auditory scene analysis have focused on using computers to perform functions carried out naturally by the human auditory system. Thanks to recent progress in machine‐learning techniques, these environmental sound‐processing techniques have significantly improved and a widening variety of applications has resulted in considerable interest in this field. In this review, we introduce the fundamental techniques of environmental sound processing, as well as recent advances in front‐end and back‐end processing and potential applications for these techniques. Prospects for further progress in the field of environmental sound processing and the challenges still to be overcome are also discussed. © 2019 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

show abstract

Very deep convolutional neural networks for raw waveforms

Cited by 264 publications

References 19 publications

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification

Incorporation of a spectral model in a convolutional neural network for accelerated spectral fitting

Environmental sound processing and its applications

Contact Info

Product

Resources

About