Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10582
|View full text |Cite
|
Sign up to set email alerts
|

Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition

Abstract: The smaller memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models. To obtain a smaller model, one can employ the model compression techniques. Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size with relatively lesser degradation in the model performance. In this approach, knowledge is distilled from a trained large size teacher model to a smaller size student model. Also, the transducer based mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 30 publications
0
2
0
Order By: Relevance
“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”
Section: Introductionmentioning
confidence: 99%
“…The commonly adopted approach requires each target compressed system with the desired size to be individually constructed, for example, in [14,15,17] for Conformer models, and similarly for SSL foundation models such as DistilHuBERT [23], FitHuBERT [24], DPHuBERT [31], PARP [20], and LightHuBERT [30] (no more than 3 systems of varying complexity were built). 2) limited scope of system complexity attributes covering only a small subset of architecture hyper-parameters based on either network depth or width alone [8,9,11,35,36], or both [10,13,14,37], while leaving out the task of low-bit quantization, or vice versa [15][16][17][18][19][32][33][34]. This is particularly the case with the recent HuBERT model distillation research [23][24][25][28][29][30][31] that are focused on architectural compression alone.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, iterative [5,21] and multi-stage [22] training methods have been proposed for acoustic model training. In [22], Rathod et al has showed that the multi-stage training can be used in T/S method to gradually compress the model without losing significant performance.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, iterative [5,21] and multi-stage [22] training methods have been proposed for acoustic model training. In [22], Rathod et al has showed that the multi-stage training can be used in T/S method to gradually compress the model without losing significant performance. Similarly, iterative pseudo-labelling [21] and selfiteration methods [5] have shown improvements in semi-supervised and unsupervised data selection for ASR.…”
Section: Introductionmentioning
confidence: 99%