SUMMARYParallels have been reported between broad organization in the auditory system and optimized artificial neural networks1–3. It remains to be seen whether such promising analogies between the auditory system and deep learning models endure at other levels of description. Here, we examined whether artificial neural networks4,5 could offer a mechanistic account of human behavior in an auditory task. The chosen task promoted the use of binaural cues (across the ears) to help detect a signal in noise6,7. In the optimal network, we observed the emergence of specialized computations with prominent similarities to in vivo animal data8. Artificial neurons developed a sensitivity to temporal delays that increased hierarchically, and were widely distributed in preference (extending to delays beyond the range permitted by head width). Ensuing dynamics were consistent with a binaural cross-correlation mechanism9. Given the neural mechanisms of binaural detection in humans is contested9–13, these findings help to resolve this debate. Moreover, this is a primary demonstration that deep learning can infer tangible mechanisms underlying auditory perception.
The binaural system utilizes interaural timing cues to improve the detection of auditory signals presented in noise. In humans, the binaural mechanisms underlying this phenomenon cannot be directly measured and hence remain contentious. As an alternative, we trained modified autoencoder networks to mimic human-like behavior in a binaural detection task. The autoencoder architecture emphasizes interpretability and, hence, we “opened it up” to see if it could infer latent mechanisms underlying binaural detection. We found that the optimal networks automatically developed artificial neurons with sensitivity to timing cues and with dynamics consistent with a cross-correlation mechanism. These computations were similar to neural dynamics reported in animal models. That these computations emerged to account for human hearing attests to their generality as a solution for binaural signal detection. This study examines the utility of explanatory-driven neural network models and how they may be used to infer mechanisms of audition.
Summary
When presented with two vowels simultaneously, humans are often able to identify the constituent vowels. Computational models exist that simulate this ability, however they predict listener confusions poorly, particularly in the case where the two vowels have the same fundamental frequency. Presented here is a model that is uniquely able to predict the combined representation of concurrent vowels. The given model is able to predict listener’s systematic perceptual decisions to a high degree of accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.