Sounds in everyday life seldom appear in isolation. Both humans and machines are constantly flooded with a cacophony of sounds that need to be sorted through and scoured for relevant information—a phenomenon referred to as the ‘cocktail party problem’. A key component in parsing acoustic scenes is the role of attention, which mediates perception and behaviour by focusing both sensory and cognitive resources on pertinent information in the stimulus space. The current article provides a review of modelling studies of auditory attention. The review highlights how the term attention refers to a multitude of behavioural and cognitive processes that can shape sensory processing. Attention can be modulated by ‘bottom-up’ sensory-driven factors, as well as ‘top-down’ task-specific goals, expectations and learned schemas. Essentially, it acts as a selection process or processes that focus both sensory and cognitive resources on the most relevant events in the soundscape; with relevance being dictated by the stimulus itself (e.g. a loud explosion) or by a task at hand (e.g. listen to announcements in a busy airport). Recent computational models of auditory attention provide key insights into its role in facilitating perception in cluttered auditory scenes.This article is part of the themed issue ‘Auditory and visual scene analysis’.
Bottom-up attention is a sensory-driven selection mechanism that directs perception toward a subset of the stimulus that is considered salient, or attention-grabbing. Most studies of bottom-up auditory attention have adapted frameworks similar to visual attention models whereby local or global “contrast” is a central concept in defining salient elements in a scene. In the current study, we take a more fundamental approach to modeling auditory attention; providing the first examination of the space of auditory saliency spanning pitch, intensity and timbre; and shedding light on complex interactions among these features. Informed by psychoacoustic results, we develop a computational model of auditory saliency implementing a novel attentional framework, guided by processes hypothesized to take place in the auditory pathway. In particular, the model tests the hypothesis that perception tracks the evolution of sound events in a multidimensional feature space, and flags any deviation from background statistics as salient. Predictions from the model corroborate the relationship between bottom-up auditory attention and statistical inference, and argues for a potential role of predictive coding as mechanism for saliency detection in acoustic scenes.
A key component in computational analysis of the auditory environment is the detection of novel sounds in the scene. Deviance detection aids in the segmentation of auditory objects and is also the basis of bottom-up auditory saliency, which is crucial in directing attention to relevant events. There is growing evidence that deviance detection is executed in the brain through mapping of the temporal regularities in the acoustic scene. The violation of these regularities is reflected as mismatch negativity (MMN), a signature electrical response observed using electro-encephalograpy (EEG) or magneto-encephalograpy (MEG). While numerous experimental results have quantified the properties of this MMN response, there have been few attempts at developing general computational frameworks of MMN that can be integrated in comprehensive models of scene analysis. In this work, we interpret the underlying mechanism of the MMN response as a Kalman-filter formulation that provides a recursive prediction of sound features based on the past sensory information; eliciting an MMN when predictions are violated. The model operates in a high-dimensional space, mimicking the rich set of features that underlie sound encoding up the level of auditory cortex. We test the proposed scheme on a variety of simple oddball paradigms adapted to various features of sounds: Pitch, intensity, direction, and inter-stimulus interval. Our model successfully finds the deviant onset times when the deviant varies from the standard in one or more of the calculated dimensions. Our results not only lay a foundation for modeling more complex elicitations of MMN, but also provide a versatile and robust mechanism for outlier detection in temporal signals and ultimately parsing of auditory scenes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.