A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging

Choi, Keunwoo; Fazekas, György; Sandler, Mark; Cho, Kyunghyun

doi:10.23919/eusipco.2018.8553106

Cited by 50 publications

(27 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this purpose, we use the dB-scale mel-scale magnitude spectrum of an input audio fragment, extracted by applying 128-band mel-filter banks on the Short-Time Fourier Transform (STFT). mel-spectrograms have generally been a popular input representation choice for CNNs applied in music-related tasks [16,17,20,26,41,64]; besides, it also was reported recently that their frequency-domain summarization, based on psycho-acoustics, is efficient and not easily learnable through data-driven approaches [65,66]. We choose a 1024-sample window size and 256-sample hop size, translating to about 46 ms and 11.6 ms respectively for a sampling rate of 22 kHz.…”

Section: Audio Preprocessingmentioning

confidence: 99%

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Kim

Urbano

Liem

et al. 2019

Neural Comput & Applic

View full text Add to dashboard Cite

Inspired by the success of deploying deep learning in the fields of Computer Vision and Natural Language Processing, this learning paradigm has also found its way into the field of Music Information Retrieval. In order to benefit from deep learning in an effective, but also efficient manner, deep transfer learning has become a common approach. In this approach, it is possible to reuse the output of a pre-trained neural network as the basis for a new learning task. The underlying hypothesis is that if the initial and new learning tasks show commonalities and are applied to the same type of input data (e.g. music audio), the generated deep representation of the data is also informative for the new task. Since, however, most of the networks used to generate deep representations are trained using a single initial learning source, their representation is unlikely to be informative for all possible future tasks. In this paper, we present the results of our investigation of what are the most important factors to generate deep representations for the data and learning tasks in the music domain. We conducted this investigation via an extensive empirical study that involves multiple learning sources, as well as multiple deep learning architectures with varying levels of information sharing between sources, in order to learn music representations. We then validate these representations considering multiple target datasets for evaluation. The results of our experiments yield several insights on how to approach the design of methods for learning widely deployable deep data representations in the music domain.

show abstract

Section: Audio Preprocessingmentioning

confidence: 99%

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Kim

Urbano

Liem

et al. 2019

Neural Comput & Applic

View full text Add to dashboard Cite

show abstract

“…The filterbank maps the FFT bins to 64 bins and thus reduces the data to be processed and stored in a later stage of the signal processing pipeline. Consecutive log compression creates a distribution of values which is more suitable for the convolutional neural network [11]. With an input segment size of 12.8 seconds the size of the time-frequency representation is Time x Frequency x Channels (T x F x C) = 24x64x1.…”

Section: Pre-processingmentioning

confidence: 99%

Event-triggered natural hazard monitoring with convolutional neural networks on the edge

Meyer

Farei-Campagna

Pasztor

et al. 2019

Proceedings of the 18th International Conference on Information Processing in Sensor Networks

View full text Add to dashboard Cite

In natural hazard warning systems fast decision making is vital to avoid catastrophes. Decision making at the edge of a wireless sensor network promises fast response times but is limited by the availability of energy, data transfer speed, processing and memory constraints. In this work we present a realization of a wireless sensor network for hazard monitoring based on an array of eventtriggered single-channel micro-seismic sensors with advanced signal processing and characterization capabilities based on a novel co-detection technique. On the one hand we leverage an ultra-low power, threshold-triggering circuit paired with on-demand digital signal acquisition capable of extracting relevant information exactly and efficiently at times when it matters most and consequentially not wasting precious resources when nothing can be observed. On the other hand we utilize machine-learning-based classification implemented on low-power, off-the-shelf microcontrollers to avoid false positive warnings and to actively identify humans in hazard zones. The sensors' response time and memory requirement is substantially improved by quantizing and pipelining the inference of a convolutional neural network. In this way, convolutional neural networks that would not run unmodified on a memory constrained device can be executed in real-time and at scale on low-power embedded devices. A field study with our system is running on the rockfall scarp of the Matterhorn Hörnligrat at 3500 m a.s.l. since 08/2018.

show abstract

“…Also back propagation algorithm was used to predict the suitable values by comparing the original images with the properties stored in the database. Audio processing in this module is based on the time frequency representation, frequency weighting and scaling techniques [37,38].…”

Section: I) Inputmentioning

confidence: 99%

Improving the Communication for Children With Speech Disorders Using the Smart Toys

Jadi¹

2019

IJAIA

View full text Add to dashboard Cite

An attempt is made to develop a smart toy to help the children suffering with communication disorders. The children suffering with such disorders need additional attention and guidance to understand different types of social events and life activities. Various issues and features of the children with speech disorders are identified in this study and based on the inputs from the study, a working architecture is proposed with suitable policies. A prediction module with a checker component is designed in this work to produce alerts in at the time of abnormal behaviour of the child with communication disorder. The model is designed very sensitively to the behaviour of the child for a particular voice tone, based on which the smart toy will change to tones automatically. Such an arrangement proved to be helpful for the children to improve the communication with other due to the inclusion of continuous training for the smart toy from the prediction module.

show abstract

A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging

Cited by 50 publications

References 14 publications

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Event-triggered natural hazard monitoring with convolutional neural networks on the edge

Improving the Communication for Children With Speech Disorders Using the Smart Toys

Contact Info

Product

Resources

About