2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2019
DOI: 10.1109/apsipaasc47483.2019.9023186
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

Abstract: This paper proposes an end-to-end emotional speech synthesis (ESS) method which adopts global style tokens (GSTs) for semi-supervised training. This model is built based on the GST-Tacotron framework. The style tokens are defined to present emotion categories. A cross entropy loss function between token weights and emotion labels is designed to obtain the interpretability of style tokens utilizing the small portion of training data with emotion labels. Emotion recognition experiments confirm that this method c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 53 publications
(37 citation statements)
references
References 12 publications
(16 reference statements)
0
37
0
Order By: Relevance
“…For synthetic speech quality, we conducted objective experiments. We used the root mean square error of F0 (F0RMSE) and melcepstrum distortions (MCD) as objective metrics, as in [22]. The FASTDTW [23] algorithm was adopted to align the predicted acoustic feature sequences with the natural ones.…”
Section: Accuracy Of Acoustic Feature Predictionmentioning
confidence: 99%
“…For synthetic speech quality, we conducted objective experiments. We used the root mean square error of F0 (F0RMSE) and melcepstrum distortions (MCD) as objective metrics, as in [22]. The FASTDTW [23] algorithm was adopted to align the predicted acoustic feature sequences with the natural ones.…”
Section: Accuracy Of Acoustic Feature Predictionmentioning
confidence: 99%
“…There have been studies to leverage emotion speech modeling for expressive TTS [33], [68], [86]- [88]. Eyben et al [68] incorporate unsupervised expression cluster information into an HMM-based TTS system.…”
Section: Deep Features For Perceptual Lossmentioning
confidence: 99%
“…At last, a weighted sum of GST is used as the style embedding. Numerous GST-Tacotron variants [11,20] have been proposed to improve model performance. However, most of them are no longer unsupervised methods.…”
Section: Related Workmentioning
confidence: 99%