2019
DOI: 10.1109/lsp.2019.2931673
|View full text |Cite
|
Sign up to set email alerts
|

An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 41 publications
(27 citation statements)
references
References 11 publications
0
27
0
Order By: Relevance
“…Therefore, we manually specify the weight of style tokens by averaging the style token weights of an audio set that belongs to a certain kind of emotion. The audio set can be directly specified as all the utterances with the same emotion label, as used in [7], when the TTS dataset has emotion labels. However, our TTS dataset has no ground-truth emotion labels, and only the soft labels predicted from the cross-domain SER model are available.…”
Section: Choice For Reference Audio Set Of Each Emotion Classmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, we manually specify the weight of style tokens by averaging the style token weights of an audio set that belongs to a certain kind of emotion. The audio set can be directly specified as all the utterances with the same emotion label, as used in [7], when the TTS dataset has emotion labels. However, our TTS dataset has no ground-truth emotion labels, and only the soft labels predicted from the cross-domain SER model are available.…”
Section: Choice For Reference Audio Set Of Each Emotion Classmentioning
confidence: 99%
“…Some other studies use the global style tokens (GST) [6] framework to model the emotional features. [7] proposes an effective style token weights control scheme that uses the centroid of weight vectors of each emotion cluster to generate speech of the emotion. [8] is also a GST-based method for emotional TTS, where the authors propose an inter-to-intra distance ratio algorithm that well considers the distribution of emotions to determine the emotion weights.…”
Section: Introductionmentioning
confidence: 99%
“…[3] introduced an end-to-end (E2E) emotional speech synthesizer based on Tacotron [8] by injecting a learned emotion embedding. [4,5] adopted pretrained global style tokens (GSTs [9]) to represent different emotions. Some systems also considered synthesizing emotion at different strength levels [10,11].…”
Section: Single Speaker Essmentioning
confidence: 99%
“…The decoder takes this regulated encoding sequence as input to predict the mel spectrogram, conditioned on speaker and emotion embeddings. We investigate two approaches [3,5] to emotion embedding, either as a free, learnable vector which we denote as BASE-EMB, or as weighted combination of GSTs which we denote as BASE-GST.…”
Section: Baselinementioning
confidence: 99%
See 1 more Smart Citation