Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist —including both vowel-specific (e.g., Lobanov, 1971; Nearey, 1978; Syrdal and Gopal, 1986) and general-purpose accounts applicable to any type of phonetic cue (McMurray and Jongman, 2011). We add to the cross-linguistic literature by comparing normalization accounts against a new database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We train Bayesian ideal observers (IOs) on unnormalized or normalized vowel data under different assumptions about the relevant cues to vowel identity (F0-F3, vowel duration), and evaluate their performance in predicting the category intended by talker. The results indicate that the best-performing normalization accounts centered and/or scaled formants by talker (e.g., Lobanov), replicating previous findings for other languages with less dense vowel spaces. The relative advantage of Lobanov decreased when including additional cues, indicating that simple centering relative to the talker’s mean might be sufficient to achieve robust inter-talker perception (e.g., C-CuRE).
Talkers vary in their vowel pronunciation. One hypothesis holds that listeners achieve robust speech perception through pre-linguistic normalization. In recent work (also submitted to ASA), we modeled listeners’ perception of naturally produced /h-VOWEL-d/ words. The best-performing normalization models accounted for ∼90% of the explainable variance in listeners’ responses. Here, we investigate whether the remaining 10% follow from (1) other mechanisms or whether (2) they reflect listeners’ ability to use more cues than available to models. We constructed a new set of *synthesized* /h-VOWEL-d/ stimuli that varied only in F1 and F2. Unsurprisingly, listeners (N = 24) performed worse on these synthesized stimuli than on the natural stimuli (estimated as inter-listener agreement in categorization). Critically though, we find (1) that the same normalization accounts that best explained listeners’ responses to natural stimuli also perform best explaining responses to synthesized stimuli; (2) the best performing model again accounted for ∼90% of explainable variance. This suggests that the ‘failure’ of normalization accounts to fully explain listeners’ categorization behavior is *not* due to restrictions in the ability to feed our models all available cues. Rather, normalization alone—while critical to perception—seems insufficient to fully explain listeners’ ability to adapt based on recent input.
One of the central computational challenges for speech perception is that talkers differ in pronunciation--i.e., how they map linguistic categories and meanings onto the acoustic signal. Yet, listeners typically overcome these difficulties within minutes (Clarke & Garrett, 2004; Xie et al., 2018). The mechanisms that underlie these adaptive abilities remain unclear. One influential hypothesis holds that listeners achieve robust speech perception across talkers through low-level pre-linguistic normalization. We investigate the role of normalization in the perception of L1-US English vowels. We train ideal observers (IOs) on unnormalized or normalized acoustic cues using a phonetic database of 8 /h-VOWEL-d/ words of US English (N = 1240 recordings from 16 talkers, Xie & Jaeger, 2020). All IOs had 0 DFs in predicting perception—i.e., their predictions are completely determined by pronunciation statistics. We compare the IOs’ predictions against L1-US English listeners’ 8-way categorization responses for /h-VOWEL-d/ words in a web-based experiment. We find that (1) pre-linguistic normalization substantially improves the fit to human responses from 74% to 90% of best-possible performance (chance = 12.5%); (2) the best-performing normalization accounts centered and/or scaled formants by talker; and (3) general purpose normalization (C-CuRE, McMurray & Jongman, 2011) performed as well as vowel-specific normalization.
Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist—including both accounts specific to vowel perception and general purpose accounts that can be applied to any type of cue. We add to the cross-linguistic literature on this matter by comparing normalization accounts against a new phonetically annotated vowel database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We evaluate normalization accounts on how they differ in predicted consequences for perception. The results indicate that the best performing accounts either center or standardize formants by talker. The study also suggests that general purpose accounts perform as well as vowel-specific accounts, and that vowel normalization operates in both temporal and spectral domains.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.