Perception of Synthetic Speech

Pisoni, David B.

doi:10.1007/978-1-4612-1894-4_43

Cited by 30 publications

(5 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On-line techniques may be more sensitive. Pisoni (1987Pisoni ( , 1997 used reaction time measures to compare and evaluate perfectly intelligible synthetic speech systems. Reaction time measures, such as lexical decision time or phoneme detection time, are assumed to reflect the speed with which different types of speech can be processed.…”

Section: Experiments 1: Linear Vs Nonlinear Time Compressionmentioning

confidence: 99%

Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech

Janse

2004

Speech Communication

View full text Add to dashboard Cite

Natural fast speech differs from normal-rate speech with respect to its temporal pattern. Previous results showed that word intelligibility of heavily artificially time-compressed speech could not be improved by making its temporal pattern more similar to that of natural fast speech. This might have been due to the extrapolation of timing rules for natural fast speech to rates that are much faster than can be attained by human speakers. The present study investigates whether, at a speech rate that human speakers can attain, artificially time-compressed speech is easier to process if its timing pattern is similar to that of naturally produced fast speech. Our first experiment suggests, however, that word processing speed was slowed down, relative to linear compression. In a second experiment, word processing of artificially time-compressed speech was compared with processing of naturally produced fast speech. Even when naturally produced fast speech is perfectly intelligible, its less careful articulation, combined with the changed timing pattern, slows down processing, relative to linearly time-compressed speech. Furthermore, listeners preferred artificially time-compressed speech over naturally produced fast speech. These results suggest that linearly time-compressed speech has both a temporal and a segmental advantage over natural fast speech.

show abstract

Section: Experiments 1: Linear Vs Nonlinear Time Compressionmentioning

confidence: 99%

Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech

Janse

2004

Speech Communication

View full text Add to dashboard Cite

show abstract

“…In prior literature, researchers have investigated the perceptions of synthetic speech in varied behavioral settings [14], but have not investigated how properties of machine-like voice such as pitch contour and flanging vary on a machine-to-human spectrum or influence user trust. Accordingly, in this study we examined two main research questions: 1) Do properties of machine-like speech vary on a machineto-human spectrum?…”

Section: Introductionmentioning

confidence: 99%

The effects of pitch contour and flanging on trust in speaking cognitive agents

Muralidharan

Visser

Parasuraman

2014

CHI '14 Extended Abstracts on Human Factors in Computing Systems

View full text Add to dashboard Cite

Speech from intelligent "cognitive agents" can vary along a machine-to-human spectrum, from very machine-like to very human-like [3,4]. Effective interaction with such agents may depend on whether they are trusted by human users. This study investigated properties of machine-like speech along a machine-to-human spectrum in order to identify those associated with higher trust. We first examined whether flanging (time delay) and pitch contour could be used to map a machine-to-human speech spectrum. We found that lower pitch range and greater time delay generated more machine-like speech. Subsequently we examined perceived trust levels for different sounds along the spectrum. We found that human-speech had higher ratings of trust than machine-like speech. Finally, we used the behavioral TNO Trust Task (T 3 ) to examine trust and compliance levels with cognitive agents speaking in different voices. The results confirmed that participants complied with and trusted agents with human speech more than agents with machine-like speech.

show abstract

“…Although Kanzi’s performance with computer-generated speech was lower than with natural speech (both unmanipulated and degraded), he still chose the correct lexigram at a rate significantly higher than chance for the unmanipulated computer-generated stimuli and for the sinusoidally degraded versions of these stimuli. It is worth noting that the specific synthesis mode (formant synthesis) used for creating the computer-generated stimuli was overtly crude and un-human like (robotic-like) and previous work has shown that humans can struggle with such computer-generated stimuli, primarily when presented with novel phrases (Pisoni 1997 ). That Kanzi could still understand even some degraded versions of formant synthesised computer-generated stimuli demonstrates the existence of perceptual mechanisms in bonobos that are remarkably resilient when presented with highly deviant, non-natural speech.…”

Section: Discussionmentioning

confidence: 99%

Degraded and computer-generated speech processing in a bonobo

et al. 2022

View full text Add to dashboard Cite

The human auditory system is capable of processing human speech even in situations when it has been heavily degraded, such as during noise-vocoding, when frequency domain-based cues to phonetic content are strongly reduced. This has contributed to arguments that speech processing is highly specialized and likely a de novo evolved trait in humans. Previous comparative research has demonstrated that a language competent chimpanzee was also capable of recognizing degraded speech, and therefore that the mechanisms underlying speech processing may not be uniquely human. However, to form a robust reconstruction of the evolutionary origins of speech processing, additional data from other closely related ape species is needed. Specifically, such data can help disentangle whether these capabilities evolved independently in humans and chimpanzees, or if they were inherited from our last common ancestor. Here we provide evidence of processing of highly varied (degraded and computer-generated) speech in a language competent bonobo, Kanzi. We took advantage of Kanzi’s existing proficiency with touchscreens and his ability to report his understanding of human speech through interacting with arbitrary symbols called lexigrams. Specifically, we asked Kanzi to recognise both human (natural) and computer-generated forms of 40 highly familiar words that had been degraded (noise-vocoded and sinusoidal forms) using a match-to-sample paradigm. Results suggest that—apart from noise-vocoded computer-generated speech—Kanzi recognised both natural and computer-generated voices that had been degraded, at rates significantly above chance. Kanzi performed better with all forms of natural voice speech compared to computer-generated speech. This work provides additional support for the hypothesis that the processing apparatus necessary to deal with highly variable speech, including for the first time in nonhuman animals, computer-generated speech, may be at least as old as the last common ancestor we share with bonobos and chimpanzees.

show abstract

Perception of Synthetic Speech

Cited by 30 publications

References 27 publications

Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech

Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech

The effects of pitch contour and flanging on trust in speaking cognitive agents

Degraded and computer-generated speech processing in a bonobo

Contact Info

Product

Resources

About