A spoken language system combines speech recognition, natural language processing and h h a n interface technology. It functions by recognizing the pervn's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate respinse back to the user. Potential applications of spoken lan 8e"systems range from simple tasks, such as retrieving informgo frdm an existing database (traffic reports, airline schedules),$to interactive problem solving tasks involving complex planning and reasoning (travel planning, traflic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: 1) robust speech recognition; 2) automatic training and adaptation; 3) spontaneous speech; 4) dialogue models; 5) natural language response generation; 6) speech synthesis and speech generation; 7) multilingual systems; and 8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and for rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area.
Algorithms based on spectral subtraction are developed for improving the intelligibility of speech that has been interfered by a second talker's voice.A number of new properties of spectral subtraction are shown, including the effects of phase on the output speech intelligibility, and the choice of magnitude spectral differences for best results.A harmonic extraction algorithm is also developed.
A novel speech analysis method which uses several established psychoacoustic concepts. the perceptually based linear predictive analysis (PLP), nedels the auditory spectrum by the spectrum of the low-order all-pole model. The auditory spectrum is derived from the speech waveform by critical-band filtering. equal-loudness curve pre-emphasis. and intensityloudness root compression. we demonstrate through analysis of both synthetic and natural speech that psychoacoustic concepts of spectral auditory integration in vowel perception, namely the Fl, F2' concept of Carlson and Fant and the 3.5 Bark auditory integration concept of Chistovich, are well modeled by the PLP method. A complete speech analysis-synthesis system based on the PLP method is also described in the paper. INODUCTIONHuman speech contains a large amount of information. For speech recognition or low bit-rate speech compression applications, the relevant information is the phonetic information. All the other information in speech, such as information about the speaker's sex, identity. etc., is extraneous. The question of which parameters bear the phonetic information is a fundamental issue in speech research and has been extensively studied. Several studies indicate that a relatively small number of parameters is sufficient for complete phonetic specification of speech sounds.Low dimensionality of parametric speech representation is highly desirable for present machine speech processing techniques. The problem of lct-dirnensional parametric representation has been most studied for vowels. We discuss below the work which directly applies to our current research. Based on perceptual experiments, Carlson et al. [21 propose a two peak representation of vowels. The first spectral peak frequency Fl is identical with the first formant frequency of the vowel. The frequency of the second peak, denoted as P2', is determined by experimental phonetic match of the two-peak synthetic speech stimuli to the given vowel. We will refer to the concept of Carlson et al. as the Fl, P2' concept. Carlson et al. [31 and later Bladen [11 propose empirical formulae for P2' computation from the values of the four lowest ferment frequencies. The work [31 also deals with estimation of F2' values directly from the speech signal using histograms of the zerocrossing frequencies from closely spaced broad-band filters. Itahashi and Yokoyama [9] propose to extract the Fl and P2' peaks by 6th order all-pole modeling of the mel-warped high order (20th) LP system spectrum.Chistovich et al. [41 propose a concept in which the vowel spectrum is represented by major peeks. obtained by a two-stage process: peak extraction and auditory integration. She and her colleagues report that stimuli with peaks closer than 3 Bark can be approximated by one-peak stimuli with the peak position determined by the center of gravity of the two original peaks. When the distance between spectral peaks increases to more than 3.5 Bark, the one-peak representation is not possible. We will refer to the concept of Cbist...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.