ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

Gu, Yu; Yin, Xiang; Rao, Yonghui; Yuan, Weihua; Tang, Benlai; Zhang, Yang; Chen, Jitong; Wang, Yuxuan; Ma, Zejun

doi:10.48550/arxiv.2004.11012

Cited by 13 publications

(16 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…M k can be calculated in closed form time [7]. 5 Audio samples are available via https://diffsinger.github.io.…”

Section: Diffusion Modelmentioning

confidence: 99%

See 1 more Smart Citation

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Liu¹,

Li²,

Ren³

et al. 2021

Preprint

View full text Add to dashboard Cite

Singing voice synthesis (SVS) system is built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from over-smoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we propose DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model. DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. By implicitly optimizing variational bound, DiffSinger can be stably trained and generates realistic outputs. To further improve the voice quality and speed up inference, we introduce a shallow diffusion mechanism to make better use of the prior knowledge learned by the simple loss. Specifically, DiffSinger starts generation at a shallow step smaller than the total number of diffusion steps, according to the intersection of the diffusion trajectories of the ground-truth mel-spectrogram and the one predicted by a simple mel-spectrogram decoder. Besides, we train a boundary prediction network to locate the intersection and determine the shallow step adaptively. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work. Our extensional experiments also prove the generalization of DiffSinger on text-to-speech task. * Equal contribution. 2 A music score consists of lyrics, pitch and duration.Preprint. Under review.

show abstract

“…M k can be calculated in closed form time [7]. 5 Audio samples are available via https://diffsinger.github.io.…”

Section: Diffusion Modelmentioning

confidence: 99%

“…Finally, since the pipeline of SVS resembles that of text-to-speech (TTS) task, we make adjustments to DiffSinger for generalization. The contributions of this work can be summarized as follows 5 In this section, we introduce the theory of diffusion probabilistic model [7,31]. The full proof can be found in previous works [7,13,32].…”

Section: Introductionmentioning

confidence: 99%

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Liu¹,

Li²,

Ren³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…ML-GAN in HiFiSinger helps supervise waveform reconstruction and achieves good results in single speaker singing voice synthesis, but its F0 embedding reduces model generalizability in multi-speaker singing data. ByteSing (Gu et al 2020) is a Chinese SVS system based on duration allocated Tacotronlike acoustic model and WaveRNN vocoder. The authors report that those systems can generate natural singing voices.…”

Section: Related Workmentioning

confidence: 99%

“…Most previous works focus on optimizing the acoustic model, but usually use speech vocoders for SVS (Gu et al 2020;Chen et al 2020). Some speech vocoders have been widely applied to SVS, such as WaveRNN in ByteSing (Gu et al 2020) and Parallel WaveGAN in HiFiSinger (Chen et al 2020). However, as an important component in SVS, the vocoder directly impacts the upper bound of generated audio quality.…”

Section: Introductionmentioning

confidence: 99%

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

Huang,

Cui,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

High-fidelity singing voice synthesis is challenging for neural vocoders due to extremely long continuous pronunciation, high sampling rate and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches in the generated spectrogram and poor high-frequency reconstruction. To tackle the difficulty of singing modeling, in this paper, we propose SingGAN, a singing voice vocoder with generative adversarial network. Specifically, 1) SingGAN uses source excitation to alleviate the glitch problem in the spectrogram; and 2) SingGAN adopts multi-band discriminators and introduces frequency-domain loss and sub-band feature matching loss to supervise high-frequency reconstruction. To our knowledge, SingGAN is the first vocoder designed towards high-fidelity multi-speaker singing voice synthesis. Experimental results show that SingGAN synthesizes singing voices with much higher quality (0.41 MOS gains) over the previous method. Further experiments show that combined with FastSpeech 2 as an acoustic model, SingGAN achieves high robustness in the singing voice synthesis pipeline and also performs well in speech synthesis. Audio samples are available at https://SingGAN.github.io/.

show abstract

“…Singing voice synthesis (SVS) aims to synthesize high-quality and expressive singing voices based on musical score information. Singing voice synthesis (SVS) systems [2,14,22] take music score and lyric information as input to generate singing voices, and these systems have been widely deployed in music softwares, music boxes, and so on. SVS systems could generate singing voices with comparable quality to reference songs, which attract widespread research interest.…”

Section: Introductionmentioning

confidence: 99%

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Huang

Chen

Ren

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective.3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline. Samples are available at https://Multi-Singer.github.io/ CCS CONCEPTS• Applied computing → Sound and music computing; • Computing methodologies → Natural language generation.

show abstract

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

Cited by 13 publications

References 10 publications

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Contact Info

Product

Resources

About