2018 26th European Signal Processing Conference (EUSIPCO) 2018
DOI: 10.23919/eusipco.2018.8553106
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging

Abstract: In this paper, we empirically investigate the effect of audio preprocessing on music tagging with deep neural networks. We perform comprehensive experiments involving audio preprocessing using different time-frequency representations, logarithmic magnitude compression, frequency weighting, and scaling. We show that many commonly used input preprocessing techniques are redundant except magnitude compression.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
27
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 50 publications
(27 citation statements)
references
References 14 publications
0
27
0
Order By: Relevance
“…For this purpose, we use the dB-scale mel-scale magnitude spectrum of an input audio fragment, extracted by applying 128-band mel-filter banks on the Short-Time Fourier Transform (STFT). mel-spectrograms have generally been a popular input representation choice for CNNs applied in music-related tasks [16,17,20,26,41,64]; besides, it also was reported recently that their frequency-domain summarization, based on psycho-acoustics, is efficient and not easily learnable through data-driven approaches [65,66]. We choose a 1024-sample window size and 256-sample hop size, translating to about 46 ms and 11.6 ms respectively for a sampling rate of 22 kHz.…”
Section: Audio Preprocessingmentioning
confidence: 99%
“…For this purpose, we use the dB-scale mel-scale magnitude spectrum of an input audio fragment, extracted by applying 128-band mel-filter banks on the Short-Time Fourier Transform (STFT). mel-spectrograms have generally been a popular input representation choice for CNNs applied in music-related tasks [16,17,20,26,41,64]; besides, it also was reported recently that their frequency-domain summarization, based on psycho-acoustics, is efficient and not easily learnable through data-driven approaches [65,66]. We choose a 1024-sample window size and 256-sample hop size, translating to about 46 ms and 11.6 ms respectively for a sampling rate of 22 kHz.…”
Section: Audio Preprocessingmentioning
confidence: 99%
“…The filterbank maps the FFT bins to 64 bins and thus reduces the data to be processed and stored in a later stage of the signal processing pipeline. Consecutive log compression creates a distribution of values which is more suitable for the convolutional neural network [11]. With an input segment size of 12.8 seconds the size of the time-frequency representation is Time x Frequency x Channels (T x F x C) = 24x64x1.…”
Section: Pre-processingmentioning
confidence: 99%
“…Also back propagation algorithm was used to predict the suitable values by comparing the original images with the properties stored in the database. Audio processing in this module is based on the time frequency representation, frequency weighting and scaling techniques [37,38].…”
Section: I) Inputmentioning
confidence: 99%