When multiple people talk simultaneously, the healthy human auditory system is able to attend to one particular speaker of interest. Recently, it has been demonstrated that it is possible to infer to which speaker someone is attending by relating the neural activity, recorded by electroencephalography (EEG), with the speech signals. This is relevant for an effective noise suppression in hearing devices, in order to detect the target speaker in a multi-speaker scenario. Most auditory attention detection algorithms use a linear EEG decoder to reconstruct the attended stimulus envelope, which is then compared to the original stimuli envelopes to determine the attended speaker. Classifying attention within a short time interval remains the main challenge. We present two different convolutional neural network (CNN)-based approaches to solve this problem. One aims to select the attended speaker from a given set of individual speaker envelopes, and the other extracts the locus of auditory attention (left or right), without knowledge of the speech envelopes. Our results show that it is possible to decode attention within 1-2 seconds, with a median accuracy around 80%, without access to the speech envelopes. This is promising for neuro-steered noise suppression in hearing aids, which requires fast and accurate attention detection. Furthermore, the possibility of detecting the locus of auditory attention without access to the speech envelopes is promising for the scenarios in which per-speaker envelopes are unavailable. It will also enable establishing a fast and objective attention measure in future studies.
Index TermsConvolutional neural networks (CNN), auditory attention detection (AAD), electroencephalography (EEG), neurosteered auditory prosthesis, brain-computer interface (BCI)