Direct Expressive Voice Training Based on Semantic Selection

Jauk, Igor; Bonafonte, Antonio

doi:10.21437/interspeech.2016-979

Cited by 1 publication

(2 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Semantic vector representations of text have been used to perform a look-up in the training corpus for expressive speech data according to the textual input, such that, relying on semantic information, data clusters were used to train expressive voices via speaker adaptation, as for example in [1]. A logical evolution of this study is to use embeddings which are more dedicated to the expressiveness in text.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Expressive Speech Synthesis Using Sentiment Embeddings

Jauk¹,

Lorenzo-Trueba²,

Yamagishi³

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

In this paper we present a DNN based speech synthesis system trained on an audiobook including sentiment features predicted by the Stanford sentiment parser. The baseline system uses DNN to predict acoustic parameters based on conventional linguistic features, as they have been used in statistical parametric speech synthesis. The predicted parameters are transformed into speech using a conventional high-quality vocoder. In this paper, the conventional linguistic features are enriched using sentiment features. Different sentiment representations have been considered, combining sentiment probabilities with hierarchical distance and context. After preliminary analysis a listening experiment is conducted, where participants evaluate the different systems. The results show the usefulness of the proposed features and reveal differences between expert and non-expert TTS user.

show abstract

Section: Introductionmentioning

confidence: 99%

“…A further improvement in comparison to work presented in [1] is the migration from HMM-based synthesis to DNN-based synthesis. A main drawback of the HMM-based synthesis is that the training data is clustered.…”

Section: Introductionmentioning

confidence: 99%