“…The development of neural end-to-end text-to-speech (TTS) models [1,2,3,4,5,6] has greatly promoted speech synthesis. Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”