This paper proposes a new approach to speech synthesis based on waveform segments. One novel point of this approach is its new formant frequency modification algorithm which makes it possible to flexibly change formant frequency and so reproduce the desired speech quality. The algorithm characterizes speech formants not only by formant frequencies and formant bandwidths, but also by spectral intensities of formant frequencies. The desirable formant structure, which is specified by the parameters, is obtained by iteratively modifying the formant bandwidths. Using the specified formant structure, the speech signal is synthesized by FFT. Evaluation by the acoustic distance measure, and by listening tests, confiis the good performance of the approach. IM'RODUCTIONIt has been reported that a text-to-speech system based on time-domain prosodic modification can synthesize high quality speech[ 11. We developed a text-to-speech system using a similar approach, and an earlier report confiied the good performance of our system[2]. The unique ideas and advantages of our system are as follows:(1)phoneme segment units are selected from a large database (45.000 phoneme segments) which consists of not only word utterances, but also sentence utterances. This makes it possible to locate more appropriate phoneme segments characterized by phoneme environment, sentence structure, and so on.(2)Optimum phoneme segments are selected considering various factors, such as phoneme environment, phoneme duration, pitch frequency contour, and speech power. This makes it possible to synthesize more natural speech.In this paper, we propose a new formant fiequency modification algorithm, and introduce the algorithm into a waveform-based speech synthesis system. There are two aims. One is to overcome the limitation of phoneme variety caused by the size of the speech database. Through system performance evaluation, we found that our original range of phonemes was insufficient, especially for some rare phoneme environments. In these cases. formant frequency modification is necessary. The other aim is to synthesize speech with various qualities.such as male speech, female speech, childish speech, husky speech, and so on [3][4][5]. Because constructing a large database is expensive and time consuming, to control speech quality by formant frequency modification is much preferable to constructing a database that covers a wide variety of speakers.The novel points of the proposed algorithm are (1) introducing spectral intensity as a parameter to specify the formant structure. and (2)iteratively modifying speech to produce the specified formant structure. Section 2 introduces the new formant frequency modification algorithm. In section 3, to c o n f i i the proposed algorithm performance, formant structures are iteratively modified so as to reproduce the desired formant structure. Section 4 applies the proposed algorithm to a waveform-based speech synthesis system. A NEW FORMANT FREQUENCY MODJFlCATlONAlJ." Outline of the algorithmIn the AR model of speech,...
acomputer. When ANSER was first developed in 1981, the system This paper descnbes major research and development in speech had only voice response capability and could accept input only from recognition and synthesis technologies at N T r from the touch-tone telephones through DTMF signals Speech recognition US telecommunications applications viewpoint Technologies include added by the end of bat year, permitung system access through ordinary speaker-dependent. speaker-independent word recognition based on DP dial telephones Later. facsimile and modem access capabilities were matching. speaker-independent word spotting based on H h N , large added Figure 1 shows a typical ANSER system configuration for a vocabulary. speaker-independent continuous speech recogniuon based banking application ANSER systems are in place in more than 15 on HMM-LR and high-quality Japanese Text-to-Speech synthesis A cities across Japan, with all ANSER centers interconnccred by a daw commercial ANSER system that uses speech recognition and synthesis communicauons network Custoniers can access an ANSER center and technologies is also introduced obmn banking services for a small fee wherever they lire Speaker-independent speech recognition is particularly difficult through telephone lines because. in addiiion to variation, among speakers, telephone sets and lines cause varying amoun& of distoruon The System's 16-word kXiCOn COIlSiSts Of the 10 digits and six conuol words In Japanese A huge amount of telephone speech with a wide range of telephone-set and line vanatlOnS and speaker characterisucs was collected 10 form a speech database. The samples came from three regions of Japan and were generated by 15% male and female s e e r s ranging In age from 20 to 60 Yeus The basic idea for bocabulavindependent word recognilion based on DP matching was introduced Namely, each word 15 expressed as a sequence of phoneme [emplates 1.Introduction The "Multimedia Em" will Smn based on [he advent of B. ISDN and Under these cucumstances. the variou~ new services will uulize video. speech, text, data and other multimedia
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.