A Bi-directional Transformer for Musical Chord Recognition

Park, Jonggwon; Choi, Kyoyun; Jeon, Sung-Wook; Kim, Do Kyun; Park, Jonghun

doi:10.48550/arxiv.1907.02698

Cited by 3 publications

(4 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We ran our experiments using the previously stated state-ofthe-art classifier [15] and implemented the necessary modifications in order to run both the focal loss and self-learning examples. For our labeled dataset, we used the Isophonics Queen and Beatles dataset [2], and as our (large) unlabeled data, we used the audios indicated by the DALI dataset [18] which results in around 5,000 songs without chord label annotations.…”

Section: Resultsmentioning

confidence: 99%

“…Major and minor chords still dominate, but that is inevitable since they tend to accompany the rare chords. It is also important to notice that the generated subset has a smaller variety of chords, as the classifier we used, a state-of-the-art ACR technique based on the Transformer architecture [15], has a limited number of classes it is able to predict. If selectedDuration ≥ desiredDuration then move to next chord type Another important component of [13] is the addition of noise to the selected subset.…”

Section: Self-learningmentioning

confidence: 99%

See 1 more Smart Citation

Improving the Classification of Rare Chords With Unlabeled Data

Bortolozzo

Schramm

Jung

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we explore techniques to improve performance for rare classes in the task of Automatic Chord Recognition (ACR). We first explored the use of the focal loss in the context of ACR, which was originally proposed to improve the classification of hard samples. In parallel, we adapted a self-learning technique originally designed for image recognition to the musical domain. Our experiments show that both approaches individually (and their combination) improve the recognition of rare chords, but using only self-learning with noise addition yields the best results.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Self-learningmentioning

confidence: 99%

Improving the Classification of Rare Chords With Unlabeled Data

Bortolozzo

Schramm

Jung

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Transformer networks [37] have been shown to work well for a wide range of MIR tasks [38][39][40][41][42][43][44]. In this paper, we adopt the music tagging transformer proposed in [44] as our musical instrument recognition module, f IR .…”

Section: Instrument Recognition Module F Irmentioning

confidence: 99%

Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

Cheuk¹,

Choi²,

Kong³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multiinstrument functionality while the connection between the transcription and source separation modules is for better transcription performance.Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.

show abstract

“…WaveNet is a model designed to take raw waveforms as input, and has inspired several recent audio related machine learning models [4][5][6]. Despite these advances, countless models are still using frequency domain features as the model's input for various tasks due to their superior performance [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Therefore, there is still value in developing a faster timefrequency conversion computation method, which is what we propose in this paper.…”

Section: Introductionmentioning

confidence: 99%

nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

Cheuk,

Anderson,

Agres

et al. 2019

Preprint

View full text Add to dashboard Cite

Despite recent developments in neural network models that use raw audio as input, state-ofthe-art results from tasks such as automatic music transcription (AMT) and automatic speech recognition (ASR) are often still achieved by using frequency domain features such as spectrograms as input. Converting time domain waveforms to frequency domain spectrograms is typically considered to be a prepossessing step done before model training. This approach, however, has several drawbacks. First, it takes a lot of hard disk space to store different frequency domain representations. This is especially true during the model development and tuning process, when exploring various types of spectrograms for optimal performance. Second, if another dataset is used, one must process all the audio clips again before the network can be retrained. In this paper, we integrate the time domain to frequency domain conversion as part of the model structure, and propose a neural network based toolbox, nnAudio, which leverages 1D convolutional neural networks to perform time domain to frequency domain conversion during feed-forward. It allows on-thefly spectrogram generation without the need to store any spectrograms on the disk. This approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, which implies that this transformation process can be made trainable, and hence further optimized by gradient descent. nnAudio reduces the waveforms-to-spectrograms conversion time for 1,770 waveforms (from the MAPS dataset) from 10.64 seconds with librosa to only 0.001 seconds for Short-Time Fourier Transform (STFT), 18.3 seconds to 0.015 seconds for Mel spectrogram, 103.4 seconds to 0.258 for constant-Q transform (CQT), when using GPU on our DGX work station with CPU: Intel(R) Xeon(R) CPU E5-2698

show abstract

A Bi-directional Transformer for Musical Chord Recognition

Cited by 3 publications

References 12 publications

Improving the Classification of Rare Chords With Unlabeled Data

Improving the Classification of Rare Chords With Unlabeled Data

Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

Contact Info

Product

Resources

About