2021
DOI: 10.1101/2021.04.19.440438
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Psychometrics of Automatic Speech Recognition

Abstract: Automatic speech recognition (ASR) software has been suggested as a candidate model of the human auditory system thanks to dramatic improvements in performance in recent years. To test this hypothesis, we compared several state-of-the-art ASR systems to results from humans on a barrage of standard psychoacoustic experiments. While some systems showed qualitative agreement with humans in some tests, in others all tested systems diverged markedly from humans. In particular, none of the models used spectral invar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

5
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(15 citation statements)
references
References 53 publications
(135 reference statements)
5
10
0
Order By: Relevance
“…Our work and recent independent efforts (Weerts et al, 2021) suggest that, despite some predictive accuracy in neuroimaging studies (Millet & King, 2021;Kell, Yamins, Shook, Norman-Haignere, & McDermott, 2018) (but see, (Thompson, Bengio, & Schoenwiesner, 2019)), automatic speech recognition systems and humans diverge substantially in various perceptual domains. Our results further suggest that, far from being simply quantitative (e.g., receptive field sizes), these shortcomings are likely qualitative (e.g., lack of flexibility in task performance through exploiting alternative spectrotemporal scales) and would not be solved by such strategies as introducing different training regimens or increasing the models' capacity.…”
mentioning
confidence: 55%
See 2 more Smart Citations
“…Our work and recent independent efforts (Weerts et al, 2021) suggest that, despite some predictive accuracy in neuroimaging studies (Millet & King, 2021;Kell, Yamins, Shook, Norman-Haignere, & McDermott, 2018) (but see, (Thompson, Bengio, & Schoenwiesner, 2019)), automatic speech recognition systems and humans diverge substantially in various perceptual domains. Our results further suggest that, far from being simply quantitative (e.g., receptive field sizes), these shortcomings are likely qualitative (e.g., lack of flexibility in task performance through exploiting alternative spectrotemporal scales) and would not be solved by such strategies as introducing different training regimens or increasing the models' capacity.…”
mentioning
confidence: 55%
“…These comprise the masking and silencing manipulations (Miller & Licklider, 1950), where the performance profiles vary more widely. Although there might be a way to reconcile these diverse performance patterns by altering minor parameters in the architectures, our work together with a parallel effort using different methods (Weerts, Rosen, Clopath, & Goodman, 2021) highlights a more fundamental difficulty of these architectures to perform well in the presence of noise. Furthermore, Werts et al (Weerts et al, 2021) find that systems displayed similarities and differences in terms of what features they are tuned to (e.g., spectral vs. temporal modulations, and the use of temporal fine structure).…”
mentioning
confidence: 93%
See 1 more Smart Citation
“…As a supervised reference system, we test a trained DeepSpeech model (Amodei et al, 2016). This model is not too intensive to train, is known to obtain reasonable ASR results, and has previously been compared to human speech perception (Millet and Dunbar, 2020b;Weerts et al, 2021). We train it to generate phonemic transcriptions.…”
Section: Supervised Reference: Deepspeechmentioning
confidence: 99%
“…Feather et al (2019) used metamers as a tool to compare deep neural networks with humans. In a comparison between three speech recognition models, including a fine-tuned wav2vec 2.0 model, Weerts et al (2021) showed that wav2vec 2.0 was the best at matching human low-level psycho-acoustic behaviour. However, the model exhibited clear differences with respect to humans-showing, for example, heightened sensitivity to band-pass filtering and an under-reliance on temporal fine structure.…”
Section: Introductionmentioning
confidence: 99%