Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization

Macoskey, Jon; Strimel, Grant P.; Rastrow, Ariya

doi:10.1109/icassp39728.2021.9414652

Cited by 12 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…which includes the standard neural transducer loss [21] and an added compute cost penalty term Lcompute. Accounted for by the cumulative number of FLOPs across the components of the network for a streaming sequence, Lcompute drives more computation cost reduction while maintaining predictive performance of the model.…”

Section: End-to-end Optimizationmentioning

confidence: 99%

“…To address these motivations, in this work, we present a Transformer architecture which is able to amortize compute cost on demand at inference by dynamically activating compute components conditioned on input frames. Our work is inspired by prior investigations of dynamic compute for speech processing which have yielded notable results [21,22,23,24]. Considering speech examples can be lengthy streaming sequences with many frames of different levels of inherent complexity (e.g.…”

Section: Introductionmentioning

confidence: 99%

“…consider silent frames between acoustically rich segments), dynamic computing enables a desired balance between accuracy and efficiency by altering the compute expenditure each frame. Unlike existing methods [21,23,24] which bifurcate ASR into only two fixed branched encoder networks, with our approach, an exponential family of dynamic branches are generated from a single Transformer encoder which significantly expands the modeling space and boosts the adaptability and expressivity of the model. Lightweight prediction arbitrators (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Compute Cost Amortized Transformer for Streaming ASR

Macoskey¹,

Radfar²,

Chang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization. Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding, enabling significant reductions in compute with minimal impact on accuracy. The fully differentiable architecture is trained end-to-end with an accompanying lightweight arbitrator mechanism operating at the framelevel to make dynamic decisions on each input while a tunable loss function is used to regularize the overall level of compute against predictive performance. We report empirical results from experiments using the compute amortized Transformer-Transducer (T-T) model conducted on LibriSpeech data. Our best model can achieve a 60% compute cost reduction with only a 3% relative word error rate (WER) increase.

show abstract

Section: End-to-end Optimizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Compute Cost Amortized Transformer for Streaming ASR

Macoskey¹,

Radfar²,

Chang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We build a conventional RNN-T consisting of 5 LSTM encoding layers and 2 LSTM decoding layers with 1024 hidden units per layer and a fully connected joint layer. Furthermore, we benchmark GQ on an RNN-T variant with a branched encoder, named Bifocal RNN-T [18]. It has 2 encoders of different computational complexity and decides on-the-fly which encoder to use per input frame.…”

Section: Modelmentioning

confidence: 99%

“…Under GQ, quantization centroids are self-adjustable but in a µ-Law constrained space. As a proof-of-concept, we adopt the ASR task and conduct experiments on both the LibriSpeech and de-identified far-field datasets to evaluate GQ on three major end-to-end ASR architectures, namely conventional Recurrent Neural Network Transducer (RNN-T) [17], Bifocal RNN-T [18], and Conformer [19] [20]. Our results show that in all three architectures, GQ yields little to no accuracy loss when compressing models to S8B or even sub-5-bit (5-bit or lower).…”

Section: Introductionmentioning

confidence: 99%

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Zhang¹,

Radfar²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

show abstract