2021
DOI: 10.48550/arxiv.2110.06306
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fine-grained style control in Transformer-based Text-to-speech Synthesis

Abstract: In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-tospeech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention bl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…[13] shows that discretizing phoneme-level latent features and using an autoregressive prior generates more natural samples instead of simply sampling from a standard VAE prior. Extending the idea of GSTs, a pretrained wav2vec 2.0 [14] model can be used to capture local style patterns in a transformer-based architecture [15]. In [16], prosody is controlled by incorporating a word-level GST module in a non-attentive Tacotron model [17], with the addition of an autoregressive prior which allows high quality speech synthesis without requiring a reference audio.…”
Section: Related Workmentioning
confidence: 99%
“…[13] shows that discretizing phoneme-level latent features and using an autoregressive prior generates more natural samples instead of simply sampling from a standard VAE prior. Extending the idea of GSTs, a pretrained wav2vec 2.0 [14] model can be used to capture local style patterns in a transformer-based architecture [15]. In [16], prosody is controlled by incorporating a word-level GST module in a non-attentive Tacotron model [17], with the addition of an autoregressive prior which allows high quality speech synthesis without requiring a reference audio.…”
Section: Related Workmentioning
confidence: 99%
“…4) FG-TransformerTTS(Chen & Rudnicky, 2021): The finegrained style control on auto-regressive model Transformer-TTS. 5) Expressive FastSpeech 2(Ren et al, 2020): The combination of both multi-speaker(Chen et al, 2020b) and muli-emotion(Cui et al, 2021) FastSpeech 2, which adds the speaker and emotion d-vectors extracted by the pretrained discriminative models to the backbone.…”
mentioning
confidence: 99%