The Psychometrics of Automatic Speech Recognition

Weerts, Lotte; Clopath, Claudia; Goodman, Dan F. M.

doi:10.1101/2021.04.19.440438

Cited by 25 publications

(15 citation statements)

References 53 publications

(135 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work and recent independent efforts (Weerts et al, 2021) suggest that, despite some predictive accuracy in neuroimaging studies (Millet & King, 2021;Kell, Yamins, Shook, Norman-Haignere, & McDermott, 2018) (but see, (Thompson, Bengio, & Schoenwiesner, 2019)), automatic speech recognition systems and humans diverge substantially in various perceptual domains. Our results further suggest that, far from being simply quantitative (e.g., receptive field sizes), these shortcomings are likely qualitative (e.g., lack of flexibility in task performance through exploiting alternative spectrotemporal scales) and would not be solved by such strategies as introducing different training regimens or increasing the models' capacity.…”

mentioning

confidence: 55%

“…These comprise the masking and silencing manipulations (Miller & Licklider, 1950), where the performance profiles vary more widely. Although there might be a way to reconcile these diverse performance patterns by altering minor parameters in the architectures, our work together with a parallel effort using different methods (Weerts, Rosen, Clopath, & Goodman, 2021) highlights a more fundamental difficulty of these architectures to perform well in the presence of noise. Furthermore, Werts et al (Weerts et al, 2021) find that systems displayed similarities and differences in terms of what features they are tuned to (e.g., spectral vs. temporal modulations, and the use of temporal fine structure).…”

mentioning

confidence: 93%

“…Although there might be a way to reconcile these diverse performance patterns by altering minor parameters in the architectures, our work together with a parallel effort using different methods (Weerts, Rosen, Clopath, & Goodman, 2021) highlights a more fundamental difficulty of these architectures to perform well in the presence of noise. Furthermore, Werts et al (Weerts et al, 2021) find that systems displayed similarities and differences in terms of what features they are tuned to (e.g., spectral vs. temporal modulations, and the use of temporal fine structure). As in our work, the self-supervised CNN-Transformer model exhibited a relatively greater similarity to humans, which follows a recent trend in vision (Tuli, Dasgupta, Grant, & Griffiths, 2021).…”

mentioning

confidence: 93%

See 2 more Smart Citations

Successes and critical failures of neural networks in capturing human-like speech recognition

Adolfi¹,

Bowers²,

Poeppel³

2022

Preprint

View full text Add to dashboard Cite

Natural and artificial audition can in principle evolve different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would improve artificial hearing systems and process models of the mind and brain. Speech recognition -an area ripe for such exploration -is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimuluscomputable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting a key specification for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.

show abstract

mentioning

confidence: 55%

mentioning

confidence: 93%

mentioning

confidence: 93%

See 1 more Smart Citation

Successes and critical failures of neural networks in capturing human-like speech recognition

Adolfi¹,

Bowers²,

Poeppel³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…As a supervised reference system, we test a trained DeepSpeech model (Amodei et al, 2016). This model is not too intensive to train, is known to obtain reasonable ASR results, and has previously been compared to human speech perception (Millet and Dunbar, 2020b;Weerts et al, 2021). We train it to generate phonemic transcriptions.…”

Section: Supervised Reference: Deepspeechmentioning

confidence: 99%

“…Feather et al (2019) used metamers as a tool to compare deep neural networks with humans. In a comparison between three speech recognition models, including a fine-tuned wav2vec 2.0 model, Weerts et al (2021) showed that wav2vec 2.0 was the best at matching human low-level psycho-acoustic behaviour. However, the model exhibited clear differences with respect to humans-showing, for example, heightened sensitivity to band-pass filtering and an under-reliance on temporal fine structure.…”

Section: Introductionmentioning

confidence: 99%

Do self-supervised speech models develop human-like perception biases?

Millet¹,

Dunbar²

2022

Preprint

View full text Add to dashboard Cite

Self-supervised models for speech processing form representational spaces without using any external labels. Increasingly, they appear to be a feasible way of at least partially eliminating costly manual annotations, a problem of particular concern for low-resource languages. But what kind of representational spaces do these models construct? Human perception specializes to the sounds of listeners' native languages. Does the same thing happen in self-supervised models? We examine the representational spaces of three kinds of stateof-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding (CPC), and compare them with the perceptual spaces of French-speaking and Englishspeaking human listeners, both globally and taking account of the behavioural differences between the two language groups. We show that the CPC model shows a small native language effect, but that wav2vec 2.0 and Hu-BERT seem to develop a universal speech perception space which is not language specific. A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively finegrained perceptual phenomena, while supervised models are better at capturing coarser, phone-level, effects of listeners' native language, on perception.

show abstract

Convenience vs. Reliability? Evaluation of Human-Robot Interaction Preferences in a Production Environment

Schmidt,

Meitinger

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The Psychometrics of Automatic Speech Recognition

Cited by 25 publications

References 53 publications

Successes and critical failures of neural networks in capturing human-like speech recognition

Successes and critical failures of neural networks in capturing human-like speech recognition

Do self-supervised speech models develop human-like perception biases?

Convenience vs. Reliability? Evaluation of Human-Robot Interaction Preferences in a Production Environment

Contact Info

Product

Resources

About