2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383585
|View full text |Cite
|
Sign up to set email alerts
|

Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 8 publications
0
8
0
Order By: Relevance
“…Following [37], evaluation metrics, i.e., F0 Root Mean Square Error (F0-RMSE), F0 Pearson Correlation Coefficient (F0-PCC), and duration accuracy (duracc) are used to evaluate the synthesized results objectively. To match the length difference between the ground-truth singing voice and the generated voice, the calculation of F0-RMSE and F0-PCC is conducted on generated singing voices that were created based on the ground-truth phoneme duration.…”
Section: Resultsmentioning
confidence: 99%
“…Following [37], evaluation metrics, i.e., F0 Root Mean Square Error (F0-RMSE), F0 Pearson Correlation Coefficient (F0-PCC), and duration accuracy (duracc) are used to evaluate the synthesized results objectively. To match the length difference between the ground-truth singing voice and the generated voice, the calculation of F0-RMSE and F0-PCC is conducted on generated singing voices that were created based on the ground-truth phoneme duration.…”
Section: Resultsmentioning
confidence: 99%
“…C. Evaluation 1) Objective Evaluation: Three kinds of objective criteria that include F0 root mean square error (RMSE), F0 correlation coefficients and duration accuracy with reference to [17] are conducted to evaluate different models. In order to fairly compare F0 of synthesized singing by each model, we set real duration to all models instead of predicted duration.…”
Section: B Experimental Configurationmentioning
confidence: 99%
“…On the other hand, we apply adversarial domain adaption for the phoneme encoder to learn a pitchindependent phoneme representation. As domain adaption has attracted many research for generating voice in recent years, various adversarially trained domain classifiers are also designed for different tasks, such as speaker classifier and tone classifier for multi-speaker cross-lingual TTS [14], language classifier for cross-lingual TTS [15], noise classifier for voice cloning from noise sample [16], speaker-singer classifier for cloning speech to singing [17], and singer classifier for multisinger SVS [18]. It is worth mentioning that, to obtain more accurate pitch translation on singing voice conversion task, [19] and the removed pitch information in the encoder is compensated by feeding explicit pitch to decoder.…”
Section: Introductionmentioning
confidence: 99%
“…Shi et al [29] combined the perceptual entropy loss function with mainstream time sequence models, including RNN, transformer, and conformer for singing voice synthesis. Xue et al [30] used an acoustic model of the encoder-decoder architecture to perform end-to-end training on frame-level input. In the decoder, the RNN uses the current encoder output and the Mel spectrum of the previous time sequence as input to predict the Mel spectrum of the current time sequence.…”
Section: Introductionmentioning
confidence: 99%