Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10809
|View full text |Cite
|
Sign up to set email alerts
|

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N :M structured spa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 28 publications
0
7
0
Order By: Relevance
“…When large DNNs such as RNNT are implemented with reduced digital precision, optimal precision choices may vary across the network [28][29][30] . Similarly, implementation in analog-AI HW also requires careful layer-specific choices to balance accuracy and performance.…”
Section: Articlementioning
confidence: 99%
“…When large DNNs such as RNNT are implemented with reduced digital precision, optimal precision choices may vary across the network [28][29][30] . Similarly, implementation in analog-AI HW also requires careful layer-specific choices to balance accuracy and performance.…”
Section: Articlementioning
confidence: 99%
“…Model compression has commonly been achieved through a number of methods such as sparsity pruning [6,10,11], low-bit quantization [12,13,14], knowledge distillation [15,16], and lowrank matrix factorization [17,18]. These techniques can typically be applied regardless of the model architecture which allows them to be generalized to different tasks.…”
Section: Related Workmentioning
confidence: 99%
“…However, without structured sparsity [19], the resulting model requires irregular memory access and without hardware support, memory usage and computation become inefficient. Quantization is typically applied to reduce model weights from 32-bit floating point values down to 8-bit integer values, and is also applied to lower quantization levels (i.e., 1-bit, 2-bit, or 4-bit [5,14]) and even mixed-precision quantization [20]. However, computations on low-bit quantization level models are not available on typical real-world hardware.…”
Section: Related Workmentioning
confidence: 99%
“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”
Section: Introductionmentioning
confidence: 99%
“…The above existing researches suffer from the following limitations: 1) weak scalability when being used to produce compressed systems of varying target complexity that are tailored for diverse user devices. The commonly adopted approach requires each target compressed system with the desired size to be individually constructed, for example, in [14,15,17] for Conformer models, and similarly for SSL foundation models such as DistilHuBERT [23], FitHuBERT [24], DPHuBERT [31], PARP [20], and LightHuBERT [30] (no more than 3 systems of varying complexity were built). 2) limited scope of system complexity attributes covering only a small subset of architecture hyper-parameters based on either network depth or width alone [8,9,11,35,36], or both [10,13,14,37], while leaving out the task of low-bit quantization, or vice versa [15][16][17][18][19][32][33][34].…”
Section: Introductionmentioning
confidence: 99%