2021
DOI: 10.1121/10.0004989
|View full text |Cite
|
Sign up to set email alerts
|

Partial compensation for coarticulatory vowel nasalization across concatenative and neural text-to-speech

Abstract: This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behav… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
4
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 29 publications
1
4
0
Order By: Relevance
“…Given that devices are perceived as less communicatively competent than humans (Cowan et al, 2015;Cohn et al, 2022), looking at a device may trigger this stereotype and lower comprehension. More broadly, this finding builds on work showing that socio-indexical information and speech perception are intertwined (e.g., D 'Onofrio, 2015) and contributes to research indicating that people have distinct mental representations for humans and devices, which affect speech perception (e.g., Zellou et al, 2021).…”
Section: Discussionsupporting
confidence: 60%
“…Given that devices are perceived as less communicatively competent than humans (Cowan et al, 2015;Cohn et al, 2022), looking at a device may trigger this stereotype and lower comprehension. More broadly, this finding builds on work showing that socio-indexical information and speech perception are intertwined (e.g., D 'Onofrio, 2015) and contributes to research indicating that people have distinct mental representations for humans and devices, which affect speech perception (e.g., Zellou et al, 2021).…”
Section: Discussionsupporting
confidence: 60%
“…Second, an echo was added (delay: 0.01 s; 0.5 Pa). Listeners associate flattened pitch and echo with 'robot' voices (Wilson & Moore 2017), and prior work has shown that this procedure for resynthesis yields speech that is rated as significantly more robotic-sounding than unmodified neural TTS (Zellou, Cohn, & Block 2021).…”
Section: Methodsmentioning
confidence: 99%
“…The latter manipulation follows an approach taken in work exploring the role of 'voice anthropomorphism' in speech perception (e.g. Cowan et al 2015, Zellou, Cohn, & Block 2021.…”
Section: 2mentioning
confidence: 99%
“…Voice-AI assistants are an apt addressee for investigating the rational listener hypothesis as they are rated as 'less communicatively competent' than adult human interlocutors (Cohn et al, 2022), display many errors in recognition (e.g., 20-30% word error rate in Koenecke et al, 2020) and demonstrate difficulties extracting meaning ('natural language understanding') (Beneteau et al, 2019). Additionally, the text-to-speech (TTS) output they produce is often perceived as 'choppy' (Doyle et al, 2019;Zellou, Cohn, & Block, 2021). Indeed, recent work has shown that people modify their speech in distinct ways for voice-AI addressees (for a review, see Cohn et al, 2022), with the most prominent acoustic differences being in the prosodic domain (Cohn et al, 2022;Raveh et al, 2019;Siegert et al, 2019).…”
Section: Rational Listener Hypothesismentioning
confidence: 99%