<p>This manuscript addresses the problem of detecting, classifying, and localizing sound sources in an acoustic scene of spatial audio. We propose using bio-inspired Gammatone auditory filters for the acoustic feature extraction stage and a novel deep learning architecture encompassing convolutional, recurrent, and temporal convolutional blocks. Our system exceeded the state-of-the-art metrics on four spatial audio datasets with different levels of acoustical complexity and up to three sound sources overlapping in time. Furthermore, we also performed a comparative analysis of the gap between machine and human hearing, evidencing that our results have already exceeded the human performance in non-reverberant scenarios. </p>
<p>This manuscript addresses the problem of detecting, classifying, and localizing sound sources in an acoustic scene of spatial audio. We propose using bio-inspired Gammatone auditory filters for the acoustic feature extraction stage and a novel deep learning architecture encompassing convolutional, recurrent, and temporal convolutional blocks. Our system exceeded the state-of-the-art metrics on four spatial audio datasets with different levels of acoustical complexity and up to three sound sources overlapping in time. Furthermore, we also performed a comparative analysis of the gap between machine and human hearing, evidencing that our results have already exceeded the human performance in non-reverberant scenarios. </p>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.