Partial compensation for coarticulatory vowel nasalization across concatenative and neural text-to-speech

Zellou, Georgia; Cohn, Michelle; Block, Aleese

doi:10.1121/10.0004989

Cited by 11 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given that devices are perceived as less communicatively competent than humans (Cowan et al, 2015;Cohn et al, 2022), looking at a device may trigger this stereotype and lower comprehension. More broadly, this finding builds on work showing that socio-indexical information and speech perception are intertwined (e.g., D 'Onofrio, 2015) and contributes to research indicating that people have distinct mental representations for humans and devices, which affect speech perception (e.g., Zellou et al, 2021).…”

Section: Discussionsupporting

confidence: 60%

The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise

Aoki

Cohn

Zellou

2022

JASA Express Letters

Self Cite

View full text Add to dashboard Cite

This study examined how speaking style and guise influence the intelligibility of text-to-speech (TTS) and naturally produced human voices. Results showed that TTS voices were less intelligible overall. Although using a clear speech style improved intelligibility for both human and TTS voices (using “newscaster” neural TTS), the clear speech effect was stronger for TTS voices. Finally, a visual device guise decreased intelligibility, regardless of voice type. The results suggest that both speaking style and visual guise affect intelligibility of human and TTS voices. Findings are discussed in terms of theories about the role of social information in speech perception.

show abstract

Section: Discussionsupporting

confidence: 60%

The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise

Aoki

Cohn

Zellou

2022

JASA Express Letters

Self Cite

View full text Add to dashboard Cite

show abstract

“…Second, an echo was added (delay: 0.01 s; 0.5 Pa). Listeners associate flattened pitch and echo with 'robot' voices (Wilson & Moore 2017), and prior work has shown that this procedure for resynthesis yields speech that is rated as significantly more robotic-sounding than unmodified neural TTS (Zellou, Cohn, & Block 2021).…”

Section: Methodsmentioning

confidence: 99%

“…The latter manipulation follows an approach taken in work exploring the role of 'voice anthropomorphism' in speech perception (e.g. Cowan et al 2015, Zellou, Cohn, & Block 2021.…”

Section: 2mentioning

confidence: 99%

Listener beliefs and perceptual learning: Differences between device and human guises

Zellou,

Cohn,

Pycha

2023

lan

View full text Add to dashboard Cite

Listeners have a remarkable ability to adapt to novel speech patterns, such as a new accent or an idiosyncratic pronunciation. In almost all of the previous studies examining this phenomenon, the participating listeners had reason to believe that the speech signal was produced by a human being. However, people are increasingly interacting with voice-activated artificially intelligent (voice-AI) devices that produce speech using text-to-speech (TTS) synthesis. Will listeners also adapt to novel speech input when they believe it is produced by a device? Across three experiments, we investigate this question by exposing American English listeners to shifted pronunciations accompanied by either a ‘human’ or a ‘device’ guise and testing how this exposure affects their subsequent categorization of vowels. Our results show that listeners exhibit perceptual learning even when they believe the speaker is a device. Furthermore, listeners generalize these adjustments to new talkers, and do so particularly strongly when they believe that both old and new talkers are devices. These results have implications for models of speech perception, theories of human-computer interaction, and the interface between social cognition and linguistic theory.

show abstract

“…Voice-AI assistants are an apt addressee for investigating the rational listener hypothesis as they are rated as 'less communicatively competent' than adult human interlocutors (Cohn et al, 2022), display many errors in recognition (e.g., 20-30% word error rate in Koenecke et al, 2020) and demonstrate difficulties extracting meaning ('natural language understanding') (Beneteau et al, 2019). Additionally, the text-to-speech (TTS) output they produce is often perceived as 'choppy' (Doyle et al, 2019;Zellou, Cohn, & Block, 2021). Indeed, recent work has shown that people modify their speech in distinct ways for voice-AI addressees (for a review, see Cohn et al, 2022), with the most prominent acoustic differences being in the prosodic domain (Cohn et al, 2022;Raveh et al, 2019;Siegert et al, 2019).…”

Section: Rational Listener Hypothesismentioning

confidence: 99%

Prosodic focus marking for voice-AI and human addressees

Beier,

Cohn,

Trammel

et al. 2023

Preprint

View full text Add to dashboard Cite

Prosodic prominence (through increased pitch, intensity, duration) is thought to guide listeners’ attention to new information, which relies on the assumption that a rational listener will benefit from these prosodic cues. This study investigates production and perception of prosodic focus marking toward two types of addresses: a human and a voice assistant interlocutor, a potentially less-than-rational listener. Stimuli consisted of question-answer pairs, where American English speakers read identical sentences (e.g., “Jude saw the sun”) in response to interlocutors’ questions probing different foci (e.g., “Who saw the sun?”). Experiment 1 reveals consistent acoustic adjustments to mark focus on either the subject or object of a sentence: speakers increase vowel intensity and duration. In Experiment 2, we find that listeners reliably infer the intended information structure based on these acoustic adjustments. Across both experiments, we see no consistent difference in focus marking by type of interlocutor (human vs. voice assistant). However, listeners associate particular features (e.g., slower speech rate) with speech directed at voice assistants. Taken together, our findings suggest that while speakers apply communicative strategies from human-human interaction when addressing voice assistants, listeners expect a device-specific register.

show abstract

Partial compensation for coarticulatory vowel nasalization across concatenative and neural text-to-speech

Cited by 11 publications

References 29 publications

The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise

The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise

Listener beliefs and perceptual learning: Differences between device and human guises

Prosodic focus marking for voice-AI and human addressees

Contact Info

Product

Resources

About