The absence of alternatives/variants is a dramatical limitation of text-tospeech synthesis compared to the variety of human speech. This paper introduces the use of speech alternatives/variants in order to improve text-to-speech synthesis systems. Speech alternatives denote the variety of possibilities that a speaker has to pronounce a sentence -depending on linguistic constraints, specific strategies of the speaker, speaking style, and pragmatic constraints. During the training, symbolic and acoustic characteristics of a unit-selection speech synthesis system are statistically modelled with context-dependent parametric models (GMMs/HMMs). During the synthesis, symbolic and acoustic alternatives are exploited using a GENERAL-IZED VITERBI ALGORITHM (GVA) to determine the sequence of speech units used for the synthesis. Objective and subjective evaluations support evidence that the use of speech alternatives significantly improves speech synthesis over conventional speech synthesis systems. Beyond, speech alternatives can also be used to vary the speech synthesis for a given text. The proposed method can easily be extended to HMM-based speech synthesis.
IntroductionToday, speech synthesis systems (unit-selection [1], HMM-based [2]) are able to produce natural synthetic speech from text. Over the last decade, research has mainly focus on the modelling of speech prosody -"the music of speech" (accent/phrasing, intonation/rhythm) -for text-to-speech synthesis. Among them, GMM/HMM (Gaussian Mixture Models, and Hidden Markov Models) are today the most popular methods used to model speech prosody. In particular, the modelling of speech prosody has gradually and durably moved from short-time representations ("frame-by-frame": [3,4,5,6,7]) to the use of larger-time representations [8,9,10,11]). Also, recent researches tend to introduce deep architecture systems to model more efficiently the complexity of speech (Deep Neural Networks [12]). However, current speech synthesis systems still suffer from a number of limitations 1 2 Nicolas Obin, Christophe Veaux, Pierre Lanchantin which consequence into the fact that the synthetic speech does not totally sound as "human". In particular, the absence of alternatives/variants in the synthetic speech is a dramatical limitation compared to the variety of human speech (see figure 14.1 for illustration): for a given text, the speech synthesis system will always produce exactly the same synthetic speech. A human speaker can use a variety of alternatives/variants to pronounce a text. This variety may induce variations in the symbolic (prosodic event: accent, phrasing) and acoustic (prosody: prosodic contour; segmental: articulation, co-articulation) speech characteristics. These alternatives depend on linguistic constraints, specific strategies of the speaker, speaking style, and pragmatic constraints. Current speech synthesis systems do not exploit this variety during statistical modelling or synthesis. During the training, the symbolic and acoustic speech characteristics are ...