2018
DOI: 10.48550/arxiv.1803.09047
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

2
64
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 41 publications
(66 citation statements)
references
References 0 publications
2
64
0
Order By: Relevance
“…Since the speaker encoding network operates on waveforms, it can be used for zero-shot voice cloning from untranscribed utterances of a target speaker. Additionally, the authors of [1] demonstrate that the synthesis model can be fine-tuned on limited text and audio pairs of a new speaker to improve the speaker similarity of the Expressive Speech Synthesis: Prior works [31,25,24] on expressive speech synthesis focus on models that can be conditioned on text and a latent embedding for style or prosody. During training, the style embeddings are derived using a learnable module called Global Style Tokens (GST), that operates on the target speech for a given phrase and derives a style embedding through attention over a dictionary of learnable vectors.…”
Section: Background and Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Since the speaker encoding network operates on waveforms, it can be used for zero-shot voice cloning from untranscribed utterances of a target speaker. Additionally, the authors of [1] demonstrate that the synthesis model can be fine-tuned on limited text and audio pairs of a new speaker to improve the speaker similarity of the Expressive Speech Synthesis: Prior works [31,25,24] on expressive speech synthesis focus on models that can be conditioned on text and a latent embedding for style or prosody. During training, the style embeddings are derived using a learnable module called Global Style Tokens (GST), that operates on the target speech for a given phrase and derives a style embedding through attention over a dictionary of learnable vectors.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Several past works have focused on the problem of expressive TTS synthesis by learning latent variables for controlling the style aspects of speech synthesized for a given text [31,24]. Such models are usually trained on a single-speaker expressive speech dataset to learn meaningful latent codes for various style aspects of the speech.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In [6,7], style tokens were used to model the prosody explicitly. Meanwhile, prosody can also be enriched during prosody transfer, like [8,9,10,11]. The prosody attributes of an entire utterance or segment were extracted with a single latent variable from a reference utterance or segment which were used to control the prosody of the synthesized speech.…”
Section: Introductionmentioning
confidence: 99%
“…Address to the problem, multi-task speech synthetic methods based on reference audio feature embedding are proposed [69,84,82,86,4], which could synthesize speech with specified text, emotion and speaker identity. However, almost all of these methods need reference audio to synthesize the target speech.…”
Section: Introductionmentioning
confidence: 99%