2020
DOI: 10.1111/cogs.12823
|View full text |Cite
|
Sign up to set email alerts
|

EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition

Abstract: Despite the lack of invariance problem (the many-to-many mapping between acoustics and percepts), human listeners experience phonetic constancy and typically perceive what a speaker intends. Most models of human speech recognition (HSR) have side-stepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, carefully engineered deep learning networks allow robust, real-world automatic speech recognition (ASR). However, the complexities of de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
58
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 61 publications
(61 citation statements)
references
References 25 publications
3
58
0
Order By: Relevance
“…This result clearly demonstrates that human nonprimary auditory cortex maintains a robust, graded representation of VOT that includes the sub-phonetic details about how a particular speech token was pronounced ( Blumstein et al, 2005 ; Toscano et al, 2010 ; Toscano et al, 2018 ; Frye et al, 2007 ). Even though sub-phonetic information is not strictly necessary for mapping sound to meaning in stable, noise-free listening environments, this fine-grained acoustic detail has demonstrable effects on listeners’ behavior ( Kuhl, 1991 ; Carney, 1977 ; Pisoni and Tash, 1974 ; Massaro and Cohen, 1983 ; Andruski et al, 1994 ; McMurray et al, 2002 ; Schouten et al, 2003 ), and modern theories of speech perception agree that perceptual learning (e.g., adaptation to accented speakers) and robust cue integration would be impossible if the perception of speech sounds were strictly categorical ( Miller and Volaitis, 1989 ; Clayards et al, 2008 ; Kleinschmidt and Jaeger, 2015 ; McMurray and Jongman, 2011 ; Toscano and McMurray, 2010 ; McClelland and Elman, 1986 ; Norris and McQueen, 2008 ; Norris et al, 2016 ; Magnuson et al, 2020 ). Crucially, these data suggest that the same spatial/amplitude code that is implicated in the representation of phonetic information (from spectral or temporal cues) can also accommodate the representation of sub-phonetic information in the speech signal.…”
Section: Discussionmentioning
confidence: 99%
“…This result clearly demonstrates that human nonprimary auditory cortex maintains a robust, graded representation of VOT that includes the sub-phonetic details about how a particular speech token was pronounced ( Blumstein et al, 2005 ; Toscano et al, 2010 ; Toscano et al, 2018 ; Frye et al, 2007 ). Even though sub-phonetic information is not strictly necessary for mapping sound to meaning in stable, noise-free listening environments, this fine-grained acoustic detail has demonstrable effects on listeners’ behavior ( Kuhl, 1991 ; Carney, 1977 ; Pisoni and Tash, 1974 ; Massaro and Cohen, 1983 ; Andruski et al, 1994 ; McMurray et al, 2002 ; Schouten et al, 2003 ), and modern theories of speech perception agree that perceptual learning (e.g., adaptation to accented speakers) and robust cue integration would be impossible if the perception of speech sounds were strictly categorical ( Miller and Volaitis, 1989 ; Clayards et al, 2008 ; Kleinschmidt and Jaeger, 2015 ; McMurray and Jongman, 2011 ; Toscano and McMurray, 2010 ; McClelland and Elman, 1986 ; Norris and McQueen, 2008 ; Norris et al, 2016 ; Magnuson et al, 2020 ). Crucially, these data suggest that the same spatial/amplitude code that is implicated in the representation of phonetic information (from spectral or temporal cues) can also accommodate the representation of sub-phonetic information in the speech signal.…”
Section: Discussionmentioning
confidence: 99%
“…Secondly, in distributed approaches, adopted by models such as Distributed Cohort Model and Earshot (Magnuson et al, 2018), a word's meaning is represented by a numeric vector specifying the coordinates of that word in a high-dimensional semantic space. The status of phone units within these approaches is under debate.…”
Section: Architectures For Spoken Word Recognitionmentioning
confidence: 99%
“…The Distributed Cohort Model argues that distributed recurrent networks obviate the need for intermediate phone representations, and hence this model does not make any attempt to link patterns of activation on the hidden recurrent layer of the model to abstract phones. By contrast, the deep learning model of Magnuson et al (2018) explicitly interprets the units on its hidden layer as the fuzzy equivalents in the brain of the discrete phones of traditional linguistics.…”
Section: Architectures For Spoken Word Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…The Discriminative Lexicon is a comprehensive theory of the mental lexicon Baayen et al (2018;2019b); Chuang et al (2020b;c; that brings together several strands from independent theories: With word and paradigm theory it shares the hypothesis that words, not morphemes, stems or exponents are the relevant cognitive units (Blevins 2006;2016a); 1 with distributional semantics it shares the hypothesis that words get their meaning in utterances (Firth 1957;Landauer & Dumais 1997;Sahlgren 2008;Weaver 1955); from error-driven learning it implements the hypothesis learning is the result of minimizing prediction errors (Rescorla & Wagner 1972;Widrow & Hoff 1960). 2 From machine learning it incorporates the insight that fully connected neural networks are very successful at language learning (Boersma et al 2020;Magnuson et al 2020;Malouf 2017;Pater 2019;Prickett et al 2018).…”
Section: Introductionmentioning
confidence: 99%