2019 27th European Signal Processing Conference (EUSIPCO) 2019
DOI: 10.23919/eusipco.2019.8902550
|View full text |Cite
|
Sign up to set email alerts
|

Joint Singing Voice Separation and F0 Estimation with Deep U-Net Architectures

Abstract: Citation: Jansson, A., Bittner, R. M., Ewert, S. and Weyde, T. ORCID: 0000-0001- 8028-9905 (2019). Joint singing voice separation and F0 estimation with deep U-net architectures.Abstract-Vocal source separation and fundamental frequency estimation in music are tightly related tasks. The outputs of vocal source separation systems have previously been used as inputs to vocal fundamental frequency estimation systems; conversely, vocal fundamental frequency has been used as side information to improve vocal source… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
64
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 48 publications
(64 citation statements)
references
References 13 publications
0
64
0
Order By: Relevance
“…We use this prediction along with the vocoder features to synthesise the audio signal. We tried both the discrete representation of the fundamental frequency as described in [23] and a continuous representation, normalised to the range 0 to 1 as used in [13] and found that while the discrete representation leads to slightly higher accuracy in the output, the continuous representation produces a pitch contour perceptually more suitable for synthesis of the signal. Fig.…”
Section: Methodsmentioning
confidence: 99%
“…We use this prediction along with the vocoder features to synthesise the audio signal. We tried both the discrete representation of the fundamental frequency as described in [23] and a continuous representation, normalised to the range 0 to 1 as used in [13] and found that while the discrete representation leads to slightly higher accuracy in the output, the continuous representation produces a pitch contour perceptually more suitable for synthesis of the signal. Fig.…”
Section: Methodsmentioning
confidence: 99%
“…Most recently, Nakano et al [22] and Jansson et al [23] almost at the same time proposed to train the SVS task and the VME task jointly. Both methods obtained promising results.…”
Section: Source Separation-based Vocal Melody Extractionmentioning
confidence: 99%
“…Both methods obtained promising results. In [22], a joint U-Net model stacking SVS and VME was proposed. However, limited by the size of datasets containing both pure vocal tracks and their corresponding F0 annotations, the authors used a large internal dataset where reference F0 values were annotated by the VME method Deep Salience [5].…”
Section: Source Separation-based Vocal Melody Extractionmentioning
confidence: 99%
“…Therefore, it is essential to accurately estimate the Wiener gain in each time‐frequency slot using a method such as time‐frequency masking. Approaches to directly estimate Wiener gain have also been developed, which use deep learning techniques to model a mapping function from the mixed sound signals into time‐frequency masks, and deep networks pretrained with target sound‐source signals . On the other hand, deep clustering has been proposed as an approach for estimating not the time‐frequency mask, but time‐frequency embedding vectors, so that the embedding vectors for time‐frequency slot pairs dominated by the same sound‐source signal are close together, while those for others signals are further away.…”
Section: Recent Research Trends In Environmental Sound Processingmentioning
confidence: 99%