A Two-Stage Approach to Note-Level Transcription of a Specific Piano

Wang, Qi; Zhou, Ruohua; Yan, Yonghong

doi:10.3390/app7090901

Cited by 15 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various models such as support vector machines (SVM) [5,6], restricted Boltzmann machines (RBM) [4], long-short term memory neural networks [7], and convolutional neural networks (CNN) [8,9] have been developed to tackle this task. For example, Wang et al [10] integrate non-negative matrix factorization (NMF) with a CNN in order to improve transcription accuracy. Hawthorne et al [11] split the AMT into three sub-tasks: onset detection, frame activation, and velocity estimation, which allows them to achieve state-of-the art transcription accuracy on piano music.…”

Section: Introductionmentioning

confidence: 99%

The Impact of Audio Input Representations on Neural Network based Music Transcription

Cheuk

Agres

Herremans

2020

2020 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate the influence of using a linear-frequency spectrogram, log-frequency spectrogram, Mel spectrogram, and constant-Q transform (CQT). Our results show that a 8.33% increase in transcription accuracy and a 9.39% reduction in error can be obtained by choosing the appropriate input representation (log-frequency spectrogram with STFT window length 4,096 and 2,048 frequency bins in the spectrogram) without changing the neural network design (single layer fully connected). Our experiments also show that Mel spectrogram is a compact representation for which we can reduce the number of frequency bins to only 512 while still keeping a relatively high music transcription accuracy.

show abstract

Section: Introductionmentioning

confidence: 99%

The Impact of Audio Input Representations on Neural Network based Music Transcription

Cheuk

Agres

Herremans

2020

2020 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

show abstract

“…This onset-aware model significantly reduced note-level false positive errors, which is critical in perceptual evaluation of the transcription. Similar multi-state note modeling approaches are found in [4,[7][8][9][10] and some detect even more phases of note envelope including onset, sustain and offset [11,12]. As such, various versions of note state representations have been suggested so far and showed improved performances.…”

Section: Introductionmentioning

confidence: 69%

“…Most of recent approaches in polyphonic piano transcription are based on deep learning. The model architectures are diverse, including CNN [2,3,10,12], RNN [9,15], CRNN [4,5,16], and U-Net [17]. The loss function is typically the cross-entropy between predicted and ground truth labels but also includes the adversarial loss [5].…”

Section: Multi-state Note Modelingmentioning

confidence: 99%

“…The loss function is typically the cross-entropy between predicted and ground truth labels but also includes the adversarial loss [5]. An important direction in designing a neural network architecture is detecting note onset explicitly apart from the binary on/off states [4,9,10,12], considering that piano sound starts with a percussive tone but, after the attack park, it slowly decays with a harmonic tone [18]. This multi-state note modeling even including note offset was already explored before the DNN approaches become dominant [7,8,11].…”

Section: Multi-state Note Modelingmentioning

confidence: 99%

See 1 more Smart Citation

Polyphonic Piano Transcription Using Autoregressive Multi-State Note Model

Kwon,

Jeong,

Nam

2020

Preprint

View full text Add to dashboard Cite

Recent advances in polyphonic piano transcription have been made primarily by a deliberate design of neural network architectures that detect different note states such as onset or sustain and model the temporal evolution of the states. The majority of them, however, use separate neural networks for each note state, thereby optimizing multiple loss functions, and also they handle the temporal evolution of note states by abstract connections between the statewise neural networks or using a post-processing module. In this paper, we propose a unified neural network architecture where multiple note states are predicted as a softmax output with a single loss function and the temporal order is learned by an auto-regressive connection within the single neural network. This compact model allows to increase note states without architectural complexity. Using the MAESTRO dataset, we examine various combinations of multiple note states including on, onset, sustain, reonset, offset, and off. We also show that the autoregressive module effectively learns inter-state dependency of notes. Finally, we show that our proposed model achieves performance comparable to state-of-the-arts with fewer parameters.

show abstract

“…Comprehensive experimental comparison of DNN-and NMF-based ADT methods have been reported in [3]. Convolutional neural networks (CNNs), for example, have been used for extracting local time-frequency features from an input spectrogram [11][12][13][14][15]. Recurrent neural networks (RNNs) are expected to learn the temporal dynamics inherent in music and have successfully been used, often in combination with CNNs, for estimating the smooth onset probabilities of drum sounds at the frame level [16][17][18].…”

Section: Introductionmentioning

confidence: 99%

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

Ishizuka

Nishikimi

Yoshii

2021

Signals

View full text Add to dashboard Cite

This paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal in contrast to most conventional ADT methods that estimate the frame-level onset probabilities of drums. To estimate a tatum-level score, we propose a deep transcription model that consists of a frame-level encoder for extracting the latent features from a music signal and a tatum-level decoder for estimating a drum score from the latent features pooled at the tatum level. To capture the global repetitive structure of drum scores, which is difficult to learn with a recurrent neural network (RNN), we introduce a self-attention mechanism with tatum-synchronous positional encoding into the decoder. To mitigate the difficulty of training the self-attention-based model from an insufficient amount of paired data and to improve the musical naturalness of the estimated scores, we propose a regularized training method that uses a global structure-aware masked language (score) model with a self-attention mechanism pretrained from an extensive collection of drum scores. The experimental results showed that the proposed regularized model outperformed the conventional RNN-based model in terms of the tatum-level error rate and the frame-level F-measure, even when only a limited amount of paired data was available so that the non-regularized model underperformed the RNN-based model.

show abstract

A Two-Stage Approach to Note-Level Transcription of a Specific Piano

Cited by 15 publications

References 27 publications

The Impact of Audio Input Representations on Neural Network based Music Transcription

The Impact of Audio Input Representations on Neural Network based Music Transcription

Polyphonic Piano Transcription Using Autoregressive Multi-State Note Model

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

Contact Info

Product

Resources

About