2018
DOI: 10.1007/978-3-030-00810-9_3
|View full text |Cite
|
Sign up to set email alerts
|

Phone-Level Embeddings for Unit Selection Speech Synthesis

Abstract: Abstract. Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three of the models rely on a feed-forward DNN, the last one on an LSTM. The resulting embeddings enable replacing usual expert-based target costs by an euclidean… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 10 publications
0
2
0
Order By: Relevance
“…More recently, most unit selection systems shifted to an hybrid architecture that includes DNN to learn the cost functions [16]. Following this trend, the second system for the experiments, called hybrid TTS, is inspired by [17]. Its target cost is computed based on an euclidean distance in an embedding space.…”
Section: Corporamentioning
confidence: 99%
“…More recently, most unit selection systems shifted to an hybrid architecture that includes DNN to learn the cost functions [16]. Following this trend, the second system for the experiments, called hybrid TTS, is inspired by [17]. Its target cost is computed based on an euclidean distance in an embedding space.…”
Section: Corporamentioning
confidence: 99%
“…We define a linguistic feature vector for each phoneme in the utterance, extracted from a text utterance, and providing information about the phoneme, e.g., its identity, preceding and following neighbours, its position in the syllable/word/utterance it belongs to, etc. The linguistic features are automatically extracted [17]. The linguistic vector, of size 296, contains categorical and numerical features.…”
Section: Information Extractionmentioning
confidence: 99%