11th ISCA Speech Synthesis Workshop (SSW 11) 2021
DOI: 10.21437/ssw.2021-17
|View full text |Cite
|
Sign up to set email alerts
|

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Abstract: Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work [1], a 3step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voice… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 14 publications
(14 citation statements)
references
References 31 publications
0
14
0
Order By: Relevance
“…Without the Universal Vocoder, it would not be possible to generate the raw audio signal for hundreds of speakers included in the LibriTTS corpus. Details of the S2S method are shown in the works of Shah et al [20] and Jiao et al [66]. The main difference between these two models and our S2S model is the use of the P2P mapping to introduce pronunciation errors.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Without the Universal Vocoder, it would not be possible to generate the raw audio signal for hundreds of speakers included in the LibriTTS corpus. Details of the S2S method are shown in the works of Shah et al [20] and Jiao et al [66]. The main difference between these two models and our S2S model is the use of the P2P mapping to introduce pronunciation errors.…”
Section: Methodsmentioning
confidence: 99%
“…Huang et al [62] use a machine translation technique to generate text to train an ASR language model in a low-resource language. At the same time, Shah et al [20] and Huybrechts et al [19] employ S2S voice conversion to improve the quality of speech synthesis in the data reduction scenario.…”
Section: Synthetic Speech Generation For Pronunciation Error Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…This paradigm greatly enhanced naturalness and flexibility of speech synthesis. It enables new applications such as expressive [2,3] and low-resource [4] speech generation, speaker identity [5] and prosody transplantation [6,7]. This paper focuses on expressive speech synthesis, i.e.…”
Section: Introductionmentioning
confidence: 99%
“…On the contrary to the frame-by-frame prediction of Mel spectrogram, non-autoregressive models generate Mel spectrogram parallelly, thus avoiding the error propagation through previously predicted frames. There has been limited research conducted on expressivity in non-autoregressive TTS [11,12]. Moreover in [13], authors proposed to leverage the normalizing Flow approach conditioned by the speaker identity.…”
Section: Introductionmentioning
confidence: 99%