ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413604
|View full text |Cite
|
Sign up to set email alerts
|

Prosodic Clustering for Phoneme-Level Prosody Control in End-to-End Speech Synthesis

Abstract: This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the pho… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 17 publications
0
11
0
Order By: Relevance
“…For the duration feature, clustering is performed separately per phoneme as phoneme classes, such as vowels and consonants, differ substantially depending on their articulation characteristics. Results from our previous work [46] show that voice quality deterioration when using the outermost clusters is not so severe in F0 control compared to duration control. Thus, we have adopted a balanced clustering method for extracting duration clusters.…”
Section: Prosodic Clusteringmentioning
confidence: 85%
See 2 more Smart Citations
“…For the duration feature, clustering is performed separately per phoneme as phoneme classes, such as vowels and consonants, differ substantially depending on their articulation characteristics. Results from our previous work [46] show that voice quality deterioration when using the outermost clusters is not so severe in F0 control compared to duration control. Thus, we have adopted a balanced clustering method for extracting duration clusters.…”
Section: Prosodic Clusteringmentioning
confidence: 85%
“…Instead of training a quantized fine-grained VAE in order to learn latent representations, we simply use extracted features such as F0 and duration, the values of which are determined by standard speech processing tools. The discretization is then performed at the phoneme level using simple clustering methods, such as K-Means clustering, resulting in humanly interpretable labels which are directly applied to the dataset without requiring training [46]. An additional group of encoder and attention modules learn to model the discrete sequences and disentangle their content from the corresponding phoneme-level linguistic features.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…They disentangle these prosody features and provide more independent control. [9,10] model these features with clustering, which is a purely data-driven method that have more interpretability. In contrast to explicit representation, implicit prosody representation is more complete and richer when modelling prosody diversity, yet uninterpretable.…”
Section: Introductionmentioning
confidence: 99%
“…In order to improve the generated prosody, several variational [7,8] and non-variational [9,10] methods have been proposed to learn latent prosodic representations. Some methods [11,12] are proposed for low-level prosody control. In inference, though some prosody attributes like emotions can be captured and transferred through reference audios, other prosody attributes related to the context could be inappropriate.…”
Section: Introductionmentioning
confidence: 99%