2021
DOI: 10.1109/taslp.2021.3074757
|View full text |Cite
|
Sign up to set email alerts
|

Extracting and Predicting Word-Level Style Variations for Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 35 publications
0
5
0
Order By: Relevance
“…WSV* Word-level style variations (WSV) model. For a fair comparison, instead of Tacotron2 [1] used in the original version of WSV [15], FastSpeech 2 was adopted as the backbone in our implementation. In addition, an extra bidirectional GRU is used to consider the context information.…”
Section: Compared Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…WSV* Word-level style variations (WSV) model. For a fair comparison, instead of Tacotron2 [1] used in the original version of WSV [15], FastSpeech 2 was adopted as the backbone in our implementation. In addition, an extra bidirectional GRU is used to consider the context information.…”
Section: Compared Methodsmentioning
confidence: 99%
“…The root mean square error (RMSE) of F0 and energy, and the MSE of duration are adopted as the metrics of objective evaluation following [3,15]. To calculate the RMSE of F0 and energy, we first apply the dynamic time warping (DTW) to construct the alignment paths between the predicted mel-spectrogram and the ground-truth one.…”
Section: Objective Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…Luz's improved system could generate richer prosodic speech during the inference stage with limited training data [15]. Zhang and Ling proposed a speech synthesis model based on fine-grained style representation, called word-level style variation (WSV) [16]. In order to improve the accuracy of WSV prediction and the naturalness of synthesized speech, Zhang and Ling used a pretrained BERT model and speech information to derive semantic descriptions.…”
Section: Ttsmentioning
confidence: 99%
“…Furthermore, to analyze the explainability of MFN, the dynamic fusion graph model (DFG) is embedded into MFN, and a Graph-MFN obtained finally has excellent performance and is explainable [ 10 ]. Recently, word-level fusion representation has also been a wide concern [ 23 ]. For example, a repeated participation variation network (RAVEN) is used to model multimodal language through work representation transfer based on facial expression [ 24 ].…”
Section: Correlation Workmentioning
confidence: 99%