2019
DOI: 10.48550/arxiv.1912.12055
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

Kin Wai Cheuk,
Hans Anderson,
Kat Agres
et al.

Abstract: Despite recent developments in neural network models that use raw audio as input, state-ofthe-art results from tasks such as automatic music transcription (AMT) and automatic speech recognition (ASR) are often still achieved by using frequency domain features such as spectrograms as input. Converting time domain waveforms to frequency domain spectrograms is typically considered to be a prepossessing step done before model training. This approach, however, has several drawbacks. First, it takes a lot of hard di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…Because higher harmonics of musical sounds often have low energy as compared to the lower harmonics, we take the log of the CQT magnitude spectrogram, so that the overtone structure of sounds is emphasized. For computing the CQTs, we use an adaptation of the CQT implementation of the nnAudio library [56]. Before computing the CQTs, we resample the mono mixes at 16 KHz.…”
Section: Model Input: Cqt Log Magnitude Spectrogramsmentioning
confidence: 99%
“…Because higher harmonics of musical sounds often have low energy as compared to the lower harmonics, we take the log of the CQT magnitude spectrogram, so that the overtone structure of sounds is emphasized. For computing the CQTs, we use an adaptation of the CQT implementation of the nnAudio library [56]. Before computing the CQTs, we resample the mono mixes at 16 KHz.…”
Section: Model Input: Cqt Log Magnitude Spectrogramsmentioning
confidence: 99%
“…During training, a segment of 327, 680 samples (roughly 20 seconds) is extracted from each song with a random starting point in each epoch to train the model. We convert the audio segments to spectrograms on-the-fly using nnAudio [35], a tool for GPU-based spectrogram extraction in PyTorch. We experiment with both Constant-Q transform (CQT) and Mel spectrograms as the input representations.…”
Section: B Training Procedures and Hyperparametersmentioning
confidence: 99%