A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields

Carlin, Michael A.; Elhilali, Mounya

doi:10.1109/taslp.2015.2481179

Cited by 6 publications

(2 citation statements)

References 97 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Top-down task–based adaptations have been incorporated in attention systems by modelling the attentional gain as weights in the classifier to optimize performance based on specific task goals [77,78], or as a separate cognitive model deciding which speaker to attend to among competing sources [79]. A more holistic attention mechanism has instead used the goal-directed adaptation framework of physiological STRFs as a pre-processing stage to speech recognition, by enabling the separation of the target speech stream from the distractor soundscape it is embedded in [80]. The attentional filter provides significant gain to the target speech while being robust to previously unseen noise types.…”

Section: Applications Of Auditory Attention Modelsmentioning

confidence: 99%

Modelling auditory attention

Kaya¹,

Elhilali²

2017

Phil. Trans. R. Soc. B

Self Cite

109

101

View full text Add to dashboard Cite

Sounds in everyday life seldom appear in isolation. Both humans and machines are constantly flooded with a cacophony of sounds that need to be sorted through and scoured for relevant information—a phenomenon referred to as the ‘cocktail party problem’. A key component in parsing acoustic scenes is the role of attention, which mediates perception and behaviour by focusing both sensory and cognitive resources on pertinent information in the stimulus space. The current article provides a review of modelling studies of auditory attention. The review highlights how the term attention refers to a multitude of behavioural and cognitive processes that can shape sensory processing. Attention can be modulated by ‘bottom-up’ sensory-driven factors, as well as ‘top-down’ task-specific goals, expectations and learned schemas. Essentially, it acts as a selection process or processes that focus both sensory and cognitive resources on the most relevant events in the soundscape; with relevance being dictated by the stimulus itself (e.g. a loud explosion) or by a task at hand (e.g. listen to announcements in a busy airport). Recent computational models of auditory attention provide key insights into its role in facilitating perception in cluttered auditory scenes.This article is part of the themed issue ‘Auditory and visual scene analysis’.

show abstract

Section: Applications Of Auditory Attention Modelsmentioning

confidence: 99%

Modelling auditory attention

Kaya¹,

Elhilali²

2017

Phil. Trans. R. Soc. B

Self Cite

109

101

View full text Add to dashboard Cite

show abstract

“…Several neurophysiological studies have focused on understanding the ability of humans and animals to tune their cortical Spectro-Temporal Receptive Fields (STRFs) in order to selectively focus on target sounds, while minimizing the irrelevant acoustics and noise background [25]- [28]. Building on such studies, Carlin and Elhilali [25] trained a Gaussian Mixture Model on features obtained from both the initial and adapted STRFs. They showed that an ensemble of adapted STRFs achieves better performance in detecting speech, in the presence of noise.…”

Section: Related Workmentioning

confidence: 99%

Receptive Field Regularization Techniques for Audio Classification and Tagging with Deep Convolutional Neural Networks

Koutini,

Eghbal-zadeh,

Widmer

2021

Preprint

View full text Add to dashboard Cite

In this paper, we study the performance of variants of well-known Convolutional Neural Network (CNN) architectures on different audio tasks. We show that tuning the Receptive Field (RF) of CNNs is crucial to their generalization. An insufficient RF limits the CNN's ability to fit the training data. In contrast, CNNs with an excessive RF tend to over-fit the training data and fail to generalize to unseen testing data. As state-of-theart CNN architectures -in computer vision and other domains -tend to go deeper in terms of number of layers, their RF size increases and therefore they degrade in performance in several audio classification and tagging tasks. We study well-known CNN architectures and how their building blocks affect their receptive field. We propose several systematic approaches to control the RF of CNNs and systematically test the resulting architectures on different audio classification and tagging tasks and datasets. The experiments show that regularizing the RF of CNNs using our proposed approaches can drastically improve the generalization of models, out-performing complex architectures and pre-trained models on larger datasets. The proposed CNNs achieve state-ofthe-art results in multiple tasks, from acoustic scene classification to emotion and theme detection in music to instrument recognition, as demonstrated by top ranks in several pertinent challenges (DCASE, MediaEval) 1 .

show abstract