2021
DOI: 10.48550/arxiv.2104.03521
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Multi-Scale Style Control for Expressive Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 11 publications
0
8
0
Order By: Relevance
“…In this subsection, we review techniques on disentangling [218,117,275], controlling [353,180,240,13,267,343,192], and transferring [150,131,393,6] variation information, as shown in Table 15. [303,377,139,436,39] Disentangling with Adversarial Training When multiple styles or prosody information are entangled together, it is necessary to disentangle them during training for better expressive speech synthesis and control.…”
Section: Disentangling Controlling and Transferringmentioning
confidence: 99%
“…In this subsection, we review techniques on disentangling [218,117,275], controlling [353,180,240,13,267,343,192], and transferring [150,131,393,6] variation information, as shown in Table 15. [303,377,139,436,39] Disentangling with Adversarial Training When multiple styles or prosody information are entangled together, it is necessary to disentangle them during training for better expressive speech synthesis and control.…”
Section: Disentangling Controlling and Transferringmentioning
confidence: 99%
“…Recently, some efforts have been conducted to model the style in multiple scales or hierarchy [25], [41], [42]. In [41], the authors extended the GST to a hierarchical GST architecture, in which several GST layers are used with a residual connection, to learn hierarchical embedding information implicitly.…”
Section: B Reference Audio Based Expressive Speech Synthesismentioning
confidence: 99%
“…In their model, the first GST layer performs well at the speaker discrimination, while the representations from deeper GST layers tend to contain finer speaking styles or emotion variations. Different from this implicit way, in [25], a reference encoder-based model is trained explicitly to extract phonemelevel and global-level style features from mel-spectrograms. In this paper, the proposed method also learns information from different scales explicitly.…”
Section: B Reference Audio Based Expressive Speech Synthesismentioning
confidence: 99%
See 1 more Smart Citation
“…To model and control local prosodic variations in speech, some previous works attempt to predict finer-grained speaking styles from text, such as word level [15] and phoneme level [16]. It is more widely accepted that the style expressions of human speech are multi-scale in nature [17,18], where the global-scale style is usually observed as emotion and the local-scale is more close to the prosody variation [19,20]. These styles from different levels work together to produce rich expressiveness in speech.…”
Section: Introductionmentioning
confidence: 99%