Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

Ishizuka, Ryoto; Nishikimi, Ryo; Yoshii, Kazuyoshi

doi:10.3390/signals2030031

Cited by 4 publications

(13 citation statements)

References 39 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to their success, CNNs and RNNs are used together in an architecture named CRNN to model both acoustic and sequential features [2,3,5,9,10,12]. Recently, however, RNNs have been more commonly replaced by self-attention mechanisms [7], since this technique offers parallel computation and better performance when sufficient data are provided [4,6,16]. Finally, the learning of long-term sequential features is also performed with the help of an extra model, external to the transcription model, known as a language model [5,6].…”

Section: Architecturementioning

confidence: 99%

“…Recently, however, RNNs have been more commonly replaced by self-attention mechanisms [7], since this technique offers parallel computation and better performance when sufficient data are provided [4,6,16]. Finally, the learning of long-term sequential features is also performed with the help of an extra model, external to the transcription model, known as a language model [5,6]. This model is meant to leverage symbolic data only (which are much more abundant than data from annotated audio) and is trained exclusively on them.…”

Section: Architecturementioning

confidence: 99%

“…Different pre-processing techniques have been employed to make better use of the available datasets. For example, a strategy employed by Ishizuka et al consisted in using a source separation (SS) algorithm, in this case Spleeter [19], to remove non-drum instruments from the signal [5,6]. Unfortunately, this method deteriorated the quality of the transcriber, as SS tends to generate artifacts.…”

Section: Training Datamentioning

confidence: 99%

“…Unfortunately, this method deteriorated the quality of the transcriber, as SS tends to generate artifacts. Another method consists in synchronizing the predictions of the models to the tatum (the smallest durational subdivision of the main beat), thus effectively avoiding bias during the learning process and separating the note sequence from the tempo (BPM) [4][5][6]. This technique improved the transcription when compared to the original frame synchronicity.…”

Section: Training Datamentioning

confidence: 99%

“…In fact, only one architecture had been implemented with each dataset, without comparison to others. Here, instead, we present and assess a total of four different architectures that exploit two recent techniques: tatum synchronicity and self-attention mechanisms [5][6][7]. In terms of accuracy (the F-measure), we found that all these architectures are practically equivalent; hence, we concluded that, to a large extent, the favorable performance of our algorithm is not due to these recent improvements in DL.…”

Section: Introductionmentioning

confidence: 98%

See 4 more Smart Citations

High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data

Zehren,

Alunno,

Bientinesi

2023

Signals

View full text Add to dashboard Cite

Within the broad problem known as automatic music transcription, we considered the specific task of automatic drum transcription (ADT). This is a complex task that has recently shown significant advances thanks to deep learning (DL) techniques. Most notably, massive amounts of labeled data obtained from crowds of annotators have made it possible to implement large-scale supervised learning architectures for ADT. In this study, we explored the untapped potential of these new datasets by addressing three key points: First, we reviewed recent trends in DL architectures and focused on two techniques, self-attention mechanisms and tatum-synchronous convolutions. Then, to mitigate the noise and bias that are inherent in crowdsourced data, we extended the training data with additional annotations. Finally, to quantify the potential of the data, we compared many training scenarios by combining up to six different datasets, including zero-shot evaluations. Our findings revealed that crowdsourced datasets outperform previously utilized datasets, and regardless of the DL architecture employed, they are sufficient in size and quality to train accurate models. By fully exploiting this data source, our models produced high-quality drum transcriptions, achieving state-of-the-art results. Thanks to this accuracy, our work can be more successfully used by musicians (e.g., to learn new musical pieces by reading, or to convert their performances to MIDI) and researchers in music information retrieval (e.g., to retrieve information from the notes instead of audio, such as the rhythm or structure of a piece).

show abstract