The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach

Tits, Noé; Haddad, Kevin El; Dutoit, Thierry

doi:10.5772/intechopen.89849

Cited by 4 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The voice quality and the number of control parameters depend on the synthesis technique used [1,5]. These parameters allow variations to be created in the voice.…”

Section: Related Work and Challengesmentioning

confidence: 99%

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System

Tits¹,

Haddad

Dutoit

2021

Informatics

Self Cite

View full text Add to dashboard Cite

In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.

show abstract

“…The voice quality and the number of control parameters depend on the synthesis technique used [1,5]. These parameters allow variations to be created in the voice.…”

Section: Related Work and Challengesmentioning

confidence: 99%

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System

Tits¹,

Haddad

Dutoit

2021

Informatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speech synthesis methods can be grouped in three main categories: synthesis by concatenation, parametric synthesis and statistical parametric synthesis [4]. Among the few studies on laughter synthesis, the first attempts included techniques like synthesis by diphone concatenation [5], parametric synthesis and by using a mass-spring approach [6].…”

Section: Related Workmentioning

confidence: 99%

Laughter Synthesis: Combining Seq2seq Modeling with Transfer Learning

Tits¹,

Haddad²,

Dutoit³

2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

Despite the growing interest for expressive speech synthesis, synthesis of nonverbal expressions is an under-explored area. In this paper we propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations. We evaluate our model with a listening test, comparing its performance to an HMM-based laughter synthesis one and assess that it reaches higher perceived naturalness. Our solution is a first step towards a TTS system that would be able to synthesize speech with a control on amusement level with laughter integration.

show abstract

“…Generating natural speech is a fundamental building block in improving human-computer interaction (Tits et al, 2019). Modeling and converting emotion in speech is arguably one of the main challenges in developing more natural and expressive speech synthesis models.…”

Section: Introductionmentioning

confidence: 99%

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Kreuk¹,

Polyak²,

Copet³

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github. io/emotion.

show abstract

The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach

Cited by 4 publications

References 20 publications

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System

Laughter Synthesis: Combining Seq2seq Modeling with Transfer Learning

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Contact Info

Product

Resources

About