The most critical attribute of human language is its unbounded combinatorial nature: smaller elements can be combined into larger structures based on a grammatical system, resulting in a hierarchy of linguistic units, e.g., words, phrases, and sentences. Mentally parsing and representing such structures, however, poses challenges for speech comprehension. In speech, hierarchical linguistic structures do not have boundaries clearly defined by acoustic cues and must therefore be internally and incrementally constructed during comprehension. Here we demonstrate that during listening to connected speech, cortical activity of different time scales concurrently tracks the time course of abstract linguistic structures at different hierarchical levels, e.g. words, phrases, and sentences. Critically, the neural tracking of hierarchical linguistic structures is dissociated from the encoding of acoustic cues as well as from the predictability of incoming words. The results demonstrate that a hierarchy of neural processing timescales underlies grammar-based internal construction of hierarchical linguistic structure.
A visual scene is perceived in terms of visual objects. Similar ideas have been proposed for the analogous case of auditory scene analysis, although their hypothesized neural underpinnings have not yet been established. Here, we address this question by recording from subjects selectively listening to one of two competing speakers, either of different or the same sex, using magnetoencephalography. Individual neural representations are seen for the speech of the two speakers, with each being selectively phase locked to the rhythm of the corresponding speech stream and from which can be exclusively reconstructed the temporal envelope of that speech stream. The neural representation of the attended speech dominates responses (with latency near 100 ms) in posterior auditory cortex. Furthermore, when the intensity of the attended and background speakers is separately varied over an 8-dB range, the neural representation of the attended speech adapts only to the intensity of that speaker but not to the intensity of the background speaker, suggesting an object-level intensity gain control. In summary, these results indicate that concurrent auditory objects, even if spectrotemporally overlapping and not resolvable at the auditory periphery, are neurally encoded individually in auditory cortex and emerge as fundamental representational units for topdown attentional modulation and bottom-up neural adaptation.spectrotemporal response function | reverse correlation | phase locking | selective attention I n a complex auditory scene, humans and other animal species can perceptually detect and recognize individual auditory objects (i.e., the sound arising from a single source), even if strongly overlapping acoustically with sounds from other sources. To accomplish this remarkably difficult task, it has been hypothesized that the auditory system first decomposes the complex auditory scene into separate acoustic features and then binds the features, as appropriate, into auditory objects (1-4). The neural representations of auditory objects, each the collective representation of all the features belonging to the same auditory object, have been hypothesized to emerge in auditory cortex to become fundamental units for high-level cognitive processing (5-7). The process of parsing an auditory scene into auditory objects is computationally complex and cannot as yet be emulated by computer algorithms (8), but it occurs reliably, and often effortlessly, in the human auditory system. For example, in the classic "cocktail party problem," where multiple speakers are talking at the same time (9), human listeners can selectively attend to a chosen target speaker, even if the competing speakers are acoustically more salient (e.g., louder) or perceptually very similar (such as of the same sex) (10).To demonstrate an object-based neural representation that could subserve the robust perception of an auditory object, several key pieces of evidence are needed. The first is to demonstrate neural activity that exclusively represents a single auditory...
Summary The ability to focus on and understand one talker in a noisy social environment is a critical social-cognitive capacity, whose underlying neuronal mechanisms are unclear. We investigated the manner in which speech streams are represented in brain activity and the way that selective attention governs the brain’s representation of speech using a ‘Cocktail Party’ Paradigm, coupled with direct recordings from the cortical surface in surgical epilepsy patients. We find that brain activity dynamically tracks speech streams using both low frequency phase and high frequency amplitude fluctuations, and that optimal encoding likely combines the two. In and near low level auditory cortices, attention ‘modulates’ the representation by enhancing cortical tracking of attended speech streams, but ignored speech remains represented. In higher order regions, the representation appears to become more ‘selective,’ in that there is no detectable tracking of ignored speech. This selectivity itself seems to sharpen as a sentence unfolds.
The cortical representation of the acoustic features of continuous speech is the foundation of speech perception. In this study, noninvasive magnetoencephalography (MEG) recordings are obtained from human subjects actively listening to spoken narratives, in both simple and cocktail party-like auditory scenes. By modeling how acoustic features of speech are encoded in ongoing MEG activity as a spectrotemporal response function, we demonstrate that the slow temporal modulations of speech in a broad spectral region are represented bilaterally in auditory cortex by a phase-locked temporal code. For speech presented monaurally to either ear, this phase-locked response is always more faithful in the right hemisphere, but with a shorter latency in the hemisphere contralateral to the stimulated ear. When different spoken narratives are presented to each ear simultaneously (dichotic listening), the resulting cortical neural activity precisely encodes the acoustic features of both of the spoken narratives, but slightly weakened and delayed compared with the monaural response. Critically, the early sensory response to the attended speech is considerably stronger than that to the unattended speech, demonstrating top-down attentional gain control. This attentional gain is substantial even during the subjects' very first exposure to the speech mixture and therefore largely independent of knowledge of the speech content. Together, these findings characterize how the spectrotemporal features of speech are encoded in human auditory cortex and establish a single-trial-based paradigm to study the neural basis underlying the cocktail party phenomenon.
Speech and music have structured rhythms. Here we discuss a major acoustic correlate of spoken and musical rhythms, the slow (0.25-32Hz) temporal modulations in sound intensity and compare the modulation properties of speech and music. We analyze these modulations using over 25h of speech and over 39h of recordings of Western music. We show that the speech modulation spectrum is highly consistent across 9 languages (including languages with typologically different rhythmic characteristics). A different, but similarly consistent modulation spectrum is observed for music, including classical music played by single instruments of different types, symphonic, jazz, and rock. The temporal modulations of speech and music show broad but well-separated peaks around 5 and 2Hz, respectively. These acoustically dominant time scales may be intrinsic features of speech and music, a possibility which should be investigated using more culturally diverse samples in each domain. Distinct modulation timescales for speech and music could facilitate their perceptual analysis and its neural processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.