Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1506
|View full text |Cite
|
Sign up to set email alerts
|

Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions

Abstract: This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be removed. Therefore the improved WaveNet can generate high-quality speech waveforms only conditioned on linguistic featur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(3 citation statements)
references
References 19 publications
0
3
0
Order By: Relevance
“…Audio DeepFakes. AI-based impersonation are not limited to imagery, recent AI-synthesized content-generation are leading to the creation of highly realistic audios [28,29]. Using synthesized audios of the impersonating target can significantly make the DeepFake videos more convincing and compounds its negative impact.…”
Section: Future Directionsmentioning
confidence: 99%
“…Audio DeepFakes. AI-based impersonation are not limited to imagery, recent AI-synthesized content-generation are leading to the creation of highly realistic audios [28,29]. Using synthesized audios of the impersonating target can significantly make the DeepFake videos more convincing and compounds its negative impact.…”
Section: Future Directionsmentioning
confidence: 99%
“…Although objective measures do not directly correlate with subjective measures of human perception, they provide the means to assess the overall model performance (see, e.g. [21,22]). The objective measures used in the current setup are the root-mean-squared error (RMSE) and Pearson correlation between the reference and the synthesized signal in terms of (i) the f0 over the voiced intervals, (ii) the voiced energy, (iii) phone duration, and (iv) word duration.…”
Section: Objective Evaluationmentioning
confidence: 99%
“…In today's era, advances in Artificial Intelligence and Deep Neural Networks have led to very significant results in creating a more realistic type of synthesized audio and speech [2], [4]. Speech cloning and duplication via training the neural networks using powerful AI algorithms lead to synthesized speech.…”
Section: Introductionmentioning
confidence: 99%