Spectral properties of earlier sounds (context) influence recognition of later sounds (target). Acoustic variability in context stimuli can disrupt this process. When mean fundamental frequencies (f0’s) of preceding context sentences were highly variable across trials, shifts in target vowel categorization [due to spectral contrast effects (SCEs)] were smaller than when sentence mean f0’s were less variable; when sentences were rearranged to exhibit high or low variability in mean first formant frequencies (F1) in a given block, SCE magnitudes were equivalent [Assgari, Theodore, and Stilp (2019) J. Acoust. Soc. Am. 145(3), 1443–1454]. However, since sentences were originally chosen based on variability in mean f0, stimuli underrepresented the extent to which mean F1 could vary. Here, target vowels (/ɪ/-/ɛ/) were categorized following context sentences that varied substantially in mean F1 (experiment 1) or mean F3 (experiment 2) with variability in mean f0 held constant. In experiment 1, SCE magnitudes were equivalent whether context sentences had high or low variability in mean F1; the same pattern was observed in experiment 2 for new sentences with high or low variability in mean F3. Variability in some acoustic properties (mean f0) can be more perceptually consequential than others (mean F1, mean F3), but these results may be task-dependent.
When perceiving speech spoken by a single talker versus multiple talkers, listeners show perceptual benefits such as increased accuracy and/or decreased response time. There are several theoretical explanations for this talker adaptation phenomenon; one way to distinguish among these is to test whether adapting to stimulus structure is speech-specific or general to auditory perception. Music, like speech, is a sound class replete with acoustic variation. Here, participants completed a musical task that mirrored talker adaptation paradigms. On each trial, participants heard a tone and reported whether it was the lower (D4, 294 Hz) or higher pitched tone (F#4, 370 Hz). Tones were presented in a single- or mixed-instrument block. We predicted that perceptual benefits from structure in the acoustic signal are not specific to speech but are a general auditory response. Accordingly, we hypothesized participants would respond faster in the single-instrument block, consistent with speech studies that used a similar paradigm. Pitch judgments were faster (and more accurate) in the single instrument block, parallel to results from talker adaptation studies. In agreement with general theoretical approaches to auditory perception, perceptual benefits from signal structure are not limited to speech.
Perception of speech sounds has a long history of being compared to perception of nonspeech sounds, with rich and enduring debates regarding how closely they share similar underlying processes. In many instances, perception of nonspeech sounds is directly compared to that of speech sounds without a clear explanation of how related these sounds are to the speech they are selected to mirror (or not mirror). While the extreme acoustic variability of speech sounds is well documented, this variability is bounded by the common source of a human vocal tract. Nonspeech sounds do not share a common source, and as such, exhibit even greater acoustic variability than that observed for speech. This increased variability raises important questions about how well perception of a given nonspeech sound might resemble or model perception of speech sounds. Here, we offer a brief review of extremely diverse nonspeech stimuli that have been used in the efforts to better understand perception of speech sounds. The review is organized according to increasing spectrotemporal complexity: random noise, pure tones, multitone complexes, environmental sounds, music, speech excerpts that are not recognized as speech, and sinewave speech. Considerations are offered for stimulus selection in nonspeech perception experiments moving forward.
When speaking in noisy conditions or to a hearing-impaired listener, talkers often use clear speech which is slower, louder, and hyperarticulated relative to conversational speech. In other research, changes in speaking rate are known to affect speech perception (called temporal contrast effects, or speaking rate normalization). For example, when a sentence is spoken quickly, the voice onset time (VOT) in the next word sounds longer by comparison (e.g., more like the /t/ of “tier”); a sentence spoken slowly makes the VOT in the next word sound shorter (e.g., like the /d/ of “deer”). Typically, one sentence is manipulated to produce fast and slow versions. We tested whether naturally produced clear and conversational speaking styles would also produce these temporal contrast effects. On each trial, listeners heard either a clear (slow) sentence or a conversational (fast) sentence followed by a target word to be categorized as “deer” or “tier.” Temporal contrast effects were observed both for conversational relative to clear speech and for conversational relative to a slowed version of the conversational speech. Changing speaking styles aids speech intelligibility but may produce other consequences such as contrast effects that alter sound/word recognition.
Speech perception is shaped by acoustic properties of earlier sounds influencing recognition of later speech sounds. For example, when a context sentence is spoken at a faster rate, the following target word (varying from “deer” to “tier”) is perceived as “tier” (longer VOT) more often; when a slower context sentence is spoken, the following target word is perceived as “deer” (shorter VOT) more often. This is known as a temporal contrast effect (TCE, a.k.a. speaking rate normalization). Recently, Bosker, Sjerps, and Reinisch (2020 Scientific Reports) concluded that selective attention (to one of two simultaneous talkers) had no impact on TCEs. However, their paradigm was not an ideal test of this question; the voices heard were different talkers presented to each ear, and thus relatively easy to perceptually separate. Here, on each trial, the same talker spoke one sentence to both ears (no segregation), two sentences simultaneously to both ears (poor segregation), or a different sentence to each ear (easier segregation). Fast or slow context sentences preceded target words varying from “deer” to “tier.” TCE magnitudes were similar across all presentation modes. Results are consistent with the claims set forth by Bosker et al. – TCEs are automatic and low-level, not modulated by selective attention.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.