Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.541
|View full text |Cite
|
Sign up to set email alerts
|

Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Abstract: Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers vi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(13 citation statements)
references
References 21 publications
1
12
0
Order By: Relevance
“…Houlsby et al [16] further advanced the design of residual adapters by developing a non-linear projection mechanism over latent features within frozen feature extractors (e.g., pre-trained transformer layers). Given the fact that acoustic feature encoders are standard components for ASR models, several recent works have demonstrated the effectiveness of applying residual adapters [16] for various speech applications, such as atypical speech [15], multilingual speech [22], and children's speech recognition [14]. Meanwhile, there are related works studying how to build trainable parameters upon latent features from speaker adaptation [23] and latent space adversarial reprogramming literature.…”
Section: Parameter-efficient Learning With Frozen Asr Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…Houlsby et al [16] further advanced the design of residual adapters by developing a non-linear projection mechanism over latent features within frozen feature extractors (e.g., pre-trained transformer layers). Given the fact that acoustic feature encoders are standard components for ASR models, several recent works have demonstrated the effectiveness of applying residual adapters [16] for various speech applications, such as atypical speech [15], multilingual speech [22], and children's speech recognition [14]. Meanwhile, there are related works studying how to build trainable parameters upon latent features from speaker adaptation [23] and latent space adversarial reprogramming literature.…”
Section: Parameter-efficient Learning With Frozen Asr Modelsmentioning
confidence: 99%
“…For JUST training, we use a global learning rate scheduler as described in [33]. For residual adapter, we utilize the benchmark design from [16,15] with a latent dimension of 256 after ablations.…”
Section: Setup and Parameter-efficient Architecturesmentioning
confidence: 99%
See 1 more Smart Citation
“…pretraining on control speech and finetuning on dysarthric speech [4,5,6]. A second approach is to decrease the model size [7], or to train an inserted small module instead of finetuning the whole model [8,9], so the number of parameters learned on the dysarthric data is limited. Thirdly and differently from the solutions that work on training strategy or model structure, [10,11,12,13] focus directly on the data and do augmentation to generate more dysarthric speech for use in training.…”
Section: Introductionmentioning
confidence: 99%
“…Recent efforts on dysarthric speech recognition focused on personalized or tuned ASR models (e.g., [4,5,6,7,8,9,10]), which leverage large proprietary or non-commercial datasets of atypical speech (e.g., Project Euphonia [1]; UASpeech [11]; AphasiaBank [12]). In this work, we take a more pragmatic approach that does not require vast quantities of data, yet enables people with severe speech differences to train phrase recognition models for applications where only a constrained set of phrases is needed.…”
Section: Introductionmentioning
confidence: 99%