nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

Cheuk, Kin Wai; Anderson, Hans; Agres, Kat; Herremans, Dorien

doi:10.48550/arxiv.1912.12055

Cited by 2 publications

(2 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Because higher harmonics of musical sounds often have low energy as compared to the lower harmonics, we take the log of the CQT magnitude spectrogram, so that the overtone structure of sounds is emphasized. For computing the CQTs, we use an adaptation of the CQT implementation of the nnAudio library [56]. Before computing the CQTs, we resample the mono mixes at 16 KHz.…”

Section: Model Input: Cqt Log Magnitude Spectrogramsmentioning

confidence: 99%

BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control

Grachten¹,

Lattner

Deruty

2020

Applied Sciences

View full text Add to dashboard Cite

Deep learning has given AI-based methods for music creation a boost by over the past years. An important challenge in this field is to balance user control and autonomy in music generation systems. In this work, we present BassNet, a deep learning model for generating bass guitar tracks based on musical source material. An innovative aspect of our work is that the model is trained to learn a temporally stable two-dimensional latent space variable that offers interactive user control. We empirically show that the model can disentangle bass patterns that require sensitivity to harmony, instrument timbre, and rhythm. An ablation study reveals that this capability is because of the temporal stability constraint on latent space trajectories during training. We also demonstrate that models that are trained on pop/rock music learn a latent space that offers control over the diatonic characteristics of the output, among other things. Lastly, we present and discuss generated bass tracks for three different music fragments. The work that is presented here is a step toward the integration of AI-based technology in the workflow of musical content creators.

show abstract

Section: Model Input: Cqt Log Magnitude Spectrogramsmentioning

confidence: 99%

BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control

Grachten¹,

Lattner

Deruty

2020

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…During training, a segment of 327, 680 samples (roughly 20 seconds) is extracted from each song with a random starting point in each epoch to train the model. We convert the audio segments to spectrograms on-the-fly using nnAudio [35], a tool for GPU-based spectrogram extraction in PyTorch. We experiment with both Constant-Q transform (CQT) and Mel spectrograms as the input representations.…”

Section: B Training Procedures and Hyperparametersmentioning

confidence: 99%

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Cheuk¹,

Luo²,

Benetos³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A reconstruction loss is applied between the original spectrogram and the reconstructed spectrogram to constrain the second U-net to focus only on reconstruction. We train our model on three different datasets: MAPS, MAESTRO, and MusicNet. Our experiments show that adding the reconstruction loss can generally improve the note-level transcription accuracy when compared to the same model without the reconstruction part. Moreover, it can also boost the frame-level precision to be higher than the state-of-the-art models. The feature maps learned by our U-net contain gridlike structures (not present in the baseline model) which implies that with the presence of the reconstruction loss, the model is probably trying to count along both the time and frequency axis, resulting in a higher note-level transcription accuracy.

show abstract

nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

Cited by 2 publications

References 29 publications

BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control

BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

Contact Info

Product

Resources

About