4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Ding, Shaojin; Phoenix, Meadowlark,; He, You; Lew, Łukasz; Agrawal, Shivani; Rybakov, Oleg

doi:10.21437/interspeech.2022-10809

Cited by 14 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When large DNNs such as RNNT are implemented with reduced digital precision, optimal precision choices may vary across the network [28][29][30] . Similarly, implementation in analog-AI HW also requires careful layer-specific choices to balance accuracy and performance.…”

Section: Articlementioning

confidence: 99%

An analog-AI chip for energy-efficient speech recognition and transcription

Ambrogio,

Narayanan,

Okazaki

et al. 2023

Nature

View full text Add to dashboard Cite

Models of artificial intelligence (AI) that have billions of parameters can achieve high accuracy across a range of tasks1,2, but they exacerbate the poor energy efficiency of conventional general-purpose processors, such as graphics processing units or central processing units. Analog in-memory computing (analog-AI)3–7 can provide better energy efficiency by performing matrix–vector multiplications in parallel on ‘memory tiles’. However, analog-AI has yet to demonstrate software-equivalent (SWeq) accuracy on models that require many such tiles and efficient communication of neural-network activations between the tiles. Here we present an analog-AI chip that combines 35 million phase-change memory devices across 34 tiles, massively parallel inter-tile communication and analog, low-power peripheral circuitry that can achieve up to 12.4 tera-operations per second per watt (TOPS/W) chip-sustained performance. We demonstrate fully end-to-end SWeq accuracy for a small keyword-spotting network and near-SWeq accuracy on the much larger MLPerf8 recurrent neural-network transducer (RNNT), with more than 45 million weights mapped onto more than 140 million phase-change memory devices across five chips.

show abstract

Section: Articlementioning

confidence: 99%

An analog-AI chip for energy-efficient speech recognition and transcription

Ambrogio,

Narayanan,

Okazaki

et al. 2023

Nature

View full text Add to dashboard Cite

show abstract

“…Model compression has commonly been achieved through a number of methods such as sparsity pruning [6,10,11], low-bit quantization [12,13,14], knowledge distillation [15,16], and lowrank matrix factorization [17,18]. These techniques can typically be applied regardless of the model architecture which allows them to be generalized to different tasks.…”

Section: Related Workmentioning

confidence: 99%

“…However, without structured sparsity [19], the resulting model requires irregular memory access and without hardware support, memory usage and computation become inefficient. Quantization is typically applied to reduce model weights from 32-bit floating point values down to 8-bit integer values, and is also applied to lower quantization levels (i.e., 1-bit, 2-bit, or 4-bit [5,14]) and even mixed-precision quantization [20]. However, computations on low-bit quantization level models are not available on typical real-world hardware.…”

Section: Related Workmentioning

confidence: 99%

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

Hernandez¹,

Zhao²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model size of Conformer-based speech recognition models which typically require models with greater than 100M parameters down to just 5M parameters while minimizing impact on model quality. Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors. We propose model weight reuse at different levels within our model architecture: (i) repeating full conformer block layers, (ii) sharing specific conformer modules across layers, (iii) sharing sub-components per conformer module, and (iv) sharing decomposed sub-component weights after low-rank decomposition. By sharing weights at different levels of our model, we can retain the full model in-memory while increasing the number of virtual transformations applied to the input. Through a series of ablation studies and evaluations, we find that with weight sharing and a low-rank architecture, we can achieve a WER of 2.84 and 2.94 for Librispeech dev-clean and test-clean respectively with a 5M parameter model.

show abstract

“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”

Section: Introductionmentioning

confidence: 99%

“…The above existing researches suffer from the following limitations: 1) weak scalability when being used to produce compressed systems of varying target complexity that are tailored for diverse user devices. The commonly adopted approach requires each target compressed system with the desired size to be individually constructed, for example, in [14,15,17] for Conformer models, and similarly for SSL foundation models such as DistilHuBERT [23], FitHuBERT [24], DPHuBERT [31], PARP [20], and LightHuBERT [30] (no more than 3 systems of varying complexity were built). 2) limited scope of system complexity attributes covering only a small subset of architecture hyper-parameters based on either network depth or width alone [8,9,11,35,36], or both [10,13,14,37], while leaving out the task of low-bit quantization, or vice versa [15][16][17][18][19][32][33][34].…”

Section: Introductionmentioning

confidence: 99%

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstructionbased pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection.

show abstract

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Cited by 14 publications

References 28 publications

An analog-AI chip for energy-efficient speech recognition and transcription

An analog-AI chip for energy-efficient speech recognition and transcription

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

Identity of university Chinese heritage language learners in Hong Kong

Contact Info

Product

Resources

About