ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747664
|View full text |Cite
|
Sign up to set email alerts
|

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(7 citation statements)
references
References 10 publications
0
7
0
Order By: Relevance
“…• FastSpeech2 (Ren et al 2020). Although it is a speech synthesis model, employing it for singing voice synthesis tasks can achieve good results (Zhuang et al 2021;Zhang et al 2022b). We use the same parameter settings as FT-GAN, with the encoder and the decoder being 4-layer 8-head Feed-Forward Transformers with a hidden size of 256.…”
Section: Experiments Experiments Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…• FastSpeech2 (Ren et al 2020). Although it is a speech synthesis model, employing it for singing voice synthesis tasks can achieve good results (Zhuang et al 2021;Zhang et al 2022b). We use the same parameter settings as FT-GAN, with the encoder and the decoder being 4-layer 8-head Feed-Forward Transformers with a hidden size of 256.…”
Section: Experiments Experiments Settingsmentioning
confidence: 99%
“…• Visinger2 (Zhang et al 2022c) Visinger2 is modified from the speech synthesis model VITS (Kim et al 2021). It introduces DDSP (Engel et al 2020) in generating audio to improve performance.…”
Section: Experiments Experiments Settingsmentioning
confidence: 99%
“…Significant progress has been made in the field of Text-Music Generation through various methods, including the use of deep cross-modal correlation learning architectures that determine the similarity between temporal structures in audio and lyrics [5]. One research is VISinger [13],a complete end-to-end high-qualitysinging voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. It adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoderand adversarial decoder to realize complete end-to-end speech generation.…”
Section: Text-audio Generationmentioning
confidence: 99%
“…They proposed a shallow diffusion mechanism to improve audio quality and accelerate inference. Visinger (Zhang et al, 2022a) and Visinger2 (Zhang et al, 2022b) are fully endto-end models, and the acoustic model is trained together with the vocoder. This type of model can avoid the problem of error accumulation.…”
Section: Case Studymentioning
confidence: 99%