Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to "ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker-and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations.
In this paper we present a DNN based speech synthesis system trained on an audiobook including sentiment features predicted by the Stanford sentiment parser. The baseline system uses DNN to predict acoustic parameters based on conventional linguistic features, as they have been used in statistical parametric speech synthesis. The predicted parameters are transformed into speech using a conventional high-quality vocoder. In this paper, the conventional linguistic features are enriched using sentiment features. Different sentiment representations have been considered, combining sentiment probabilities with hierarchical distance and context. After preliminary analysis a listening experiment is conducted, where participants evaluate the different systems. The results show the usefulness of the proposed features and reveal differences between expert and non-expert TTS user.
The goal of the present article is to introduce a new concept of a perception-production timing model in human-machine communication. The model implements a low-level cognitive timing and coordination mechanism. The basic element of the model is a dynamic oscillator capable of tracking reoccurring events in time. The organization of the oscillators in a network is being referred to as the Dynamic PerceptionProduction Oscillation Model (DPPOM). The DPPOM is largely based on findings in psychological and phonetic experiments on timing in speech perception and production. It consists of two sub-systems, a perception sub-system and a production sub-system. The perception sub-system accounts for information clustering in an input sequence of events. The production sub-system accounts for speech production rhythmically entrained to the input sequence. We propose a system architecture integrating both sub-systems, providing a flexible mechanism for perception-production timing in dialogues. The model's functionality was evaluated in two experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.