We present an adaptive control scheme in a neu ral based model to improve the performance of articulatory based speech synthesis. The model generates the articulatory trajectories for English plosive-vowel patterns through regional target approximation in the articulatory domain and categorical perception in the auditory domain. The proposed method can effectively model the variations of static vowel sounds and adapt to changes and uncertainties in the dynamic plosive sounds. Simulation studies demonstrate that the proposed scheme is able to estimate and regulate the control parameters in the articulatory synthesizer to produce smooth and authentic acoustic phonetic output.
I. INT RODUCTI ONArticulatory speech synthesis holds many benefits in text to-speech (TTS) applications, e.g., automated facial animation and treatment of speech disorders [1]. Though concatenative synthesis is currently the leading method in TTS systems, it is often constrained to the available set of phonetic pat terns, speakers, and speaking styles. In contrast the articu latory synthesizer explicitly defined the phonetic properties, the physiological characteristics of the speaker, e.g., gender, age, and emotional state, and speaking style. However, it is difficult to design a articulatory synthesizer which is able to approximate the anatomic structure of the human vocal system, to generate authentic acoustic sounds, and to offer automated control of the articulators. There are few such systems that integrate the above aspects [2], [3]. Existing articulatory synthesizers usually separate the subglottal, the vocal cords and the vocal tract systems in the resulting struc ture , e.g., ArtiSynth, Vo calTractLab, et cetera [4], [5]. Some of them have to use additional acoustic module for sound generation to reduce the computational cost and to produce comparable sound output for TIS applications [6], [7]. Even though many synthesizers are able to produce sonorants such as vowels rather efficiently, the consonantal phenomena such as plosives caused by muscular constrictions are usually poorly represented. The problem also exists at phonetic boundaries, where the smooth transaction between neighboring phones requires effective control of the moving articulators.Previously Ohman used a coarticulation model to sim plify the consonantal effects as local perturbation during the vowel-consonant-vowel (VCV) synthesis. On the other hand, Birkholz, Kroger & Rube proposed to model articulatory dy namics with a tenth-order dynamical system to reproduce nat ural sounding and smooth consonant-vowel (CV) sequences.And their control of the articulator is realized via a prior arrangement of the gestural scores: target position and time constant [8], [9]. A variety of control methods have also been proposed to realize acoustic-articulatory inversion using neural networks (NNs) [10], [11]. However, when using fixed structure NNs to model the nonlinearities of the correlated the articulatory gestures and the acoustic cues, there are two major difficulties. On the ...