Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples

Sun, Aolan; Wang, Jianzong; Cheng, Ning; Tantrawenith, Methawee; Wu, Zhiyong; Meng, Helen; Xiao, Edward; Xiao, Jing

doi:10.1109/asru51503.2021.9687965

Cited by 10 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The singing voice synthesis method has experienced the traditional concatenative method based on lyrics to singing alignment [15]- [17], and the synthesis method based on statistical parameters [18], [19], to the present end-to-end deep learning method [1], [20]- [23]. The concatenative based synthesis method has high sound quality but weak generalization ability and phonemes that are not covered in the training set cannot be generated [24].…”

Section: Introductionmentioning

confidence: 99%

SUSing: SU-net for Singing Voice Synthesis

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Singing voice synthesis is a generative task that involves multi-dimensional control of the singing model, including lyrics, pitch, and duration, and includes the timbre of the singer and singing skills such as vibrato. In this paper, we proposed SU-net for singing voice synthesis named SUSing. Synthesizing singing voice is treated as a translation task between lyrics and music score and spectrum. The lyrics and music score information is encoded into a two-dimensional feature representation through the convolution layer. The two-dimensional feature and its frequency spectrum are mapped to the target spectrum in an autoregressive manner through a SU-net network. Within the SU-net the stripe pooling method is used to replace the alternate global pooling method to learn the vertical frequency relationship in the spectrum and the changes of frequency in the time domain. The experimental results on the public dataset Kiritan show that the proposed method can synthesize more natural singing voices.

show abstract

Section: Introductionmentioning

confidence: 99%

SUSing: SU-net for Singing Voice Synthesis

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Text-to-speech synthesis (TTS) aims to generate intelligible and natural speech from the input text or phoneme sequence [1]- [5]. It has a long history in the TTS research, from the method of concatenative synthesis, and statistical parametric synthesis to the recent method of deep neural TTS.…”

Section: Introductionmentioning

confidence: 99%

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, synthesizing personalized speech by textto-speech (TTS) application is highly demanded. But the previous TTS models require a mass of target speaker speeches for training. It is a high-cost task, and hard to record lots of utterances from the target speaker. Data augmentation of the speeches is a solution but leads to the low-quality synthesis speech problem. Some multi-speaker TTS models are proposed to address the issue. But the quantity of utterances of each speaker imbalance leads to the voice similarity problem. We propose the Target Domain Adaptation Speech Synthesis Network (TDASS) to address these issues. Based on the backbone of the Tacotron2 model, which is the high-quality TTS model, TDASS introduces a self-interested classifier for reducing the non-target influence. Besides, a special gradient reversal layer with different operations for target and non-target is added to the classifier. We evaluate the model on a Chinese speech corpus, the experiments show the proposed method outperforms the baseline method in terms of voice quality and voice similarity.

show abstract

“…In addition to the feature representation, most research focus on the classifier. Different classifiers have been tried on singer identification, including SVM, GMM, HMM, and random forest [2], [15]- [17]. With the successful application of deep models in various tasks [5], [18], some studies are using deep models to improve performance on singer identification, such as CRNN [19] which is a state of the art method.…”

Section: Introductionmentioning

confidence: 99%

MetaSID: Singer Identification with Domain Adaptation for Metaverse

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Metaverse has stretched the real world into unlimited space. There will be more live concerts in Metaverse. The task of singer identification is to identify the song belongs to which singer. However, there has been a tough problem in singer identification, which is the different live effects. The studio version is different from the live version, the data distribution of the training set and the test set are different, and the performance of the classifier decreases. This paper proposes the use of the domain adaptation method to solve the live effect in singer identification. Three methods of domain adaptation combined with Convolutional Recurrent Neural Network (CRNN) are designed, which are Maximum Mean Discrepancy (MMD), gradient reversal (Revgrad), and Contrastive Adaptation Network (CAN). MMD is a distance-based method, which adds domain loss. Revgrad is based on the idea that learned features can represent different domain samples. CAN is based on class adaptation, it takes into account the correspondence between the categories of the source domain and target domain. Experimental results on the public dataset of Artist20 show that CRNN-MMD leads to an improvement over the baseline CRNN by 0.14. The CRNN-RevGrad outperforms the baseline by 0.21. The CRNN-CAN achieved state of the art with the F1 measure value of 0.83 on album split.

show abstract

Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples

Cited by 10 publications

References 28 publications

SUSing: SU-net for Singing Voice Synthesis

SUSing: SU-net for Singing Voice Synthesis

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

MetaSID: Singer Identification with Domain Adaptation for Metaverse

Contact Info

Product

Resources

About