This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk--Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener's processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener.
Previous comprehension studies using postperceptual memory tests have often reported negligible differences in performance between natural speech and several kinds of synthetic speech produced by rule, despite large differences in segmental intelligibility. The present experiments investigated the comprehension of natural and synthetic speech using two different on-line tasks: word monitoring and sentence-by-sentence listening. On-line task performance was slower and less accurate for passages of synthetic speech than for passages of natural speech. Recognition memory performance in both experiments was less accurate following passages of synthetic speech than of natural speech. Monitoring performance, sentence listening times, and recognition memory accuracy all showed moderate correlations with intelligibility scores obtained using the Modified Rhyme Test. The results suggest that poorer comprehension of passages of synthetic speech is attributable in part to the greater encoding demands of synthetic speech. In contrast to earlier studies, the present results demonstrate that on-line tasks can be used to measure differences in comprehension performance between natural and synthetic speech.
We present the results of studies designed to measure the segmental intelligibility of eight textto-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.There has always been a practical need for devices that can produce and understand spoken language automatically without human intervention. At the present time, the development and use of such automated voice-response systems is no longer a matter of basic research in linguistics, engineering, or product development-the technology is now available in the form of specialized microprocessor-based speech-processing devices that can be easily integrated into numerous computer-based systems to support user-machine communication via spoken language.Speech is, without question, the most natural means of communication (Lindgren, 1967). It is automatic, requires little conscious effort or attention, and creates few, if any, demands while other tasks are carried out concurrently, especially tasks which require active use of the hands or eyes in demanding conditions. One potential use of speech is as an interface to computers. At the present time, most users interact with computers using traditional screens and keyboards. However, these systems can and will eventually be replaced by speech input/output (I/O). Speech is not only more natural for humans to use, but is also faster and less prone to errors. Although speech interfaces to computers are not yet widely available, extensive research efforts have been carried out over the last few years to develop speech recognition and synthesis technology. In this paper, we examine the use of one aspect of this technology-speech synthesis by rule, using automatic text-to-speech conversion. With a text-to-speech system, any computer can generate spoken output from a string of characters, and therefore can provide the user with a novel speech display instead of the more traditional screen. In some applications, this display may significantly reduce the user's workload and increase operator efficiency in getting information from a computer. In other applications, it may provide entirely new methods for retrieving data and other kinds of information from the computer using standard telephone voice and data channels. At the present time, speech output from computers using some form of text-to-speech conversion is still in its infancy. However, as the technology becomes more widely known and the costs decrease, much wider usage can be anticip...
Nonnative speakers of English listened to natural and synthetic speech materials. All natural speech material was spoken by a native male speaker of American English. The synthetic speech was produced by the MITalk-79 system for the first experiment and by the Prose 2000 V2.1 text-to-speech system for the second experiment. Results from Experiment 1 indicated that nonnative speakers show higher levels of performance when listening to natural speech than when listening to synthetic speech. However, nonnative speakers did not reach the level of performance of native speakers for either natural or synthetic speech. Experiment 2 provided further evidence that nonnative speakers fail to reach the same level of performance when listening to synthetic speech as native speakers. Performance of nonnative speakers on a dictation task showed high positive correlations with their general English language ability as measured by two standardized tests. Results indicate the importance of language background and experience in the perception of speech, particularly synthetic and digitally encoded speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.