ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414652
|View full text |Cite
|
Sign up to set email alerts
|

Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization

Abstract: We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks. The architecture enables a dynamic pivot for its runtime compute pathway, namely taking advantage of keyword spotting to select which component of the network to execute for a given audio frame. To accomplish this, we leverage a recurrent cell we call the Bifocal LSTM (BF-LSTM), which we detail in the paper. The architecture is compati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 17 publications
0
6
0
Order By: Relevance
“…which includes the standard neural transducer loss [21] and an added compute cost penalty term Lcompute. Accounted for by the cumulative number of FLOPs across the components of the network for a streaming sequence, Lcompute drives more computation cost reduction while maintaining predictive performance of the model.…”
Section: End-to-end Optimizationmentioning
confidence: 99%
See 2 more Smart Citations
“…which includes the standard neural transducer loss [21] and an added compute cost penalty term Lcompute. Accounted for by the cumulative number of FLOPs across the components of the network for a streaming sequence, Lcompute drives more computation cost reduction while maintaining predictive performance of the model.…”
Section: End-to-end Optimizationmentioning
confidence: 99%
“…To address these motivations, in this work, we present a Transformer architecture which is able to amortize compute cost on demand at inference by dynamically activating compute components conditioned on input frames. Our work is inspired by prior investigations of dynamic compute for speech processing which have yielded notable results [21,22,23,24]. Considering speech examples can be lengthy streaming sequences with many frames of different levels of inherent complexity (e.g.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…We build a conventional RNN-T consisting of 5 LSTM encoding layers and 2 LSTM decoding layers with 1024 hidden units per layer and a fully connected joint layer. Furthermore, we benchmark GQ on an RNN-T variant with a branched encoder, named Bifocal RNN-T [18]. It has 2 encoders of different computational complexity and decides on-the-fly which encoder to use per input frame.…”
Section: Modelmentioning
confidence: 99%
“…Under GQ, quantization centroids are self-adjustable but in a µ-Law constrained space. As a proof-of-concept, we adopt the ASR task and conduct experiments on both the LibriSpeech and de-identified far-field datasets to evaluate GQ on three major end-to-end ASR architectures, namely conventional Recurrent Neural Network Transducer (RNN-T) [17], Bifocal RNN-T [18], and Conformer [19] [20]. Our results show that in all three architectures, GQ yields little to no accuracy loss when compressing models to S8B or even sub-5-bit (5-bit or lower).…”
Section: Introductionmentioning
confidence: 99%