Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher

Xue, Heyang; Yang, Shan; Lei, Yi; Xie, Lei; Li, Xiulin

doi:10.1109/slt48900.2021.9383585

Cited by 8 publications

(8 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following [37], evaluation metrics, i.e., F0 Root Mean Square Error (F0-RMSE), F0 Pearson Correlation Coefficient (F0-PCC), and duration accuracy (duracc) are used to evaluate the synthesized results objectively. To match the length difference between the ground-truth singing voice and the generated voice, the calculation of F0-RMSE and F0-PCC is conducted on generated singing voices that were created based on the ground-truth phoneme duration.…”

Section: Resultsmentioning

confidence: 99%

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper introduces Opencpop, a publicly available highquality Mandarin singing corpus designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin songs performed by a female professional singer. Audio files are recorded with studio quality at a sampling rate of 44,100 Hz and the corresponding lyrics and musical scores are provided. All singing recordings have been phonetically annotated with phoneme boundaries and syllable (note) boundaries. To demonstrate the reliability of the released data and to provide a baseline for future research, we built baseline deep neural network-based SVS models and evaluated them with both objective metrics and subjective mean opinion score (MOS) measure. Experimental results show that the best SVS model trained on our database achieves 3.70 MOS, indicating the reliability of the provided corpus. Opencpop is released to the open-source community WeNet 1 , and the corpus, as well as synthesized demos, can be found on the project homepage 2 .

show abstract

Section: Resultsmentioning

confidence: 99%

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…C. Evaluation 1) Objective Evaluation: Three kinds of objective criteria that include F0 root mean square error (RMSE), F0 correlation coefficients and duration accuracy with reference to [17] are conducted to evaluate different models. In order to fairly compare F0 of synthesized singing by each model, we set real duration to all models instead of predicted duration.…”

Section: B Experimental Configurationmentioning

confidence: 99%

“…On the other hand, we apply adversarial domain adaption for the phoneme encoder to learn a pitchindependent phoneme representation. As domain adaption has attracted many research for generating voice in recent years, various adversarially trained domain classifiers are also designed for different tasks, such as speaker classifier and tone classifier for multi-speaker cross-lingual TTS [14], language classifier for cross-lingual TTS [15], noise classifier for voice cloning from noise sample [16], speaker-singer classifier for cloning speech to singing [17], and singer classifier for multisinger SVS [18]. It is worth mentioning that, to obtain more accurate pitch translation on singing voice conversion task, [19] and the removed pitch information in the encoder is compensated by feeding explicit pitch to decoder.…”

Section: Introductionmentioning

confidence: 99%

Pitch Preservation In Singing Voice Synthesis

Liu,

Zhu,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Suffering from limited singing voice corpus, existing singing voice synthesis (SVS) methods that build encoder-decoder neural networks to directly generate spectrogram could lead to out-of-tune issues during inference phase. To attenuate these issues, this paper presents a novel acoustic model with independent pitch encoder and phoneme encoder, which disentangles the phoneme and pitch information from music score to fully utilize the corpus. Specifically, according to equal temperament theory, the pitch encoder is constrained by a pitch metric loss that maps distances between adjacent input pitches into corresponding frequency multiples between the encoder outputs. For the phoneme encoder, based on the analysis that same phonemes corresponding to varying pitches can produce similar pronunciations, this encoder is followed by an adversarially trained pitch classifier to enforce the identical phonemes with different pitches mapping into the same phoneme feature space. By these means, the sparse phonemes and pitches in original input spaces can be transformed into more compact feature spaces respectively, where same elements cluster closely and cooperate mutually to enhance synthesis quality. Then, the outputs of the two encoders are summed together to pass through the following decoder in the acoustic model. Experimental results indicate that the proposed approaches can characterize intrinsic structure between pitch inputs to obtain better pitch synthesis accuracy and achieve superior singing synthesis performance against the advanced baseline system.

show abstract

“…Shi et al [29] combined the perceptual entropy loss function with mainstream time sequence models, including RNN, transformer, and conformer for singing voice synthesis. Xue et al [30] used an acoustic model of the encoder-decoder architecture to perform end-to-end training on frame-level input. In the decoder, the RNN uses the current encoder output and the Mel spectrum of the previous time sequence as input to predict the Mel spectrum of the current time sequence.…”

Section: Introductionmentioning

confidence: 99%

SUSing: SU-net for Singing Voice Synthesis

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Singing voice synthesis is a generative task that involves multi-dimensional control of the singing model, including lyrics, pitch, and duration, and includes the timbre of the singer and singing skills such as vibrato. In this paper, we proposed SU-net for singing voice synthesis named SUSing. Synthesizing singing voice is treated as a translation task between lyrics and music score and spectrum. The lyrics and music score information is encoded into a two-dimensional feature representation through the convolution layer. The two-dimensional feature and its frequency spectrum are mapped to the target spectrum in an autoregressive manner through a SU-net network. Within the SU-net the stripe pooling method is used to replace the alternate global pooling method to learn the vertical frequency relationship in the spectrum and the changes of frequency in the time domain. The experimental results on the public dataset Kiritan show that the proposed method can synthesize more natural singing voices.

show abstract

Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher

Cited by 8 publications

References 8 publications

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Pitch Preservation In Singing Voice Synthesis

SUSing: SU-net for Singing Voice Synthesis

Contact Info

Product

Resources

About