Scaling Up Online Speech Recognition Using ConvNets

Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice. To overcome the issue, we present a training and pruning method for ASR based on the connectionist temporal classification (CTC) which allows reduction of model depth at run-time without any extra fine-tuning. To achieve the goal, we adopt two regularization methods, intermediate CTC and stochastic depth, to train a model whose performance does not degrade much after pruning. We present an in-depth analysis of layer behaviors using singular vector canonical correlation analysis (SVCCA), and efficient strategies for finding layers which are safe to prune. Using the proposed method, we show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU, while each pruned sub-model maintains the accuracy of individually trained model of the same depth.

show abstract

“…Then, the training objective is defined by combining Eqs. ( 6) and ( 9): L := (1 − w)L CTC + wL InterCTC (10) with a hyper-parameter w.…”

Section: Intermediate Ctcmentioning

confidence: 99%

“…Connectionist temporal classification (CTC) [7] has been a widely used method for end-to-end ASR modeling [8,9,10,11,12], and it is especially an attractive method for the on-device ASR. For low-end devices, CTC is suitable for lightweight modeling.…”

Section: Introductionmentioning

confidence: 99%

Layer Pruning on Demand with Intermediate CTC

2021

View full text Add to dashboard Cite

show abstract

“…Decoder lag can be reduced by changing model architectures to reduce computation, as well as modifying the loss function to encour-age the model to output tokens more promptly [23]. Unlike in previous works [18,24] inter alia, we examine emission delay of the first token, and the time required to finalize the ASR result without considering per-token emission delays.…”

Section: Measuring Latency Metricsmentioning

confidence: 99%

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Shangguan¹,

Prabhavalkar²,

Su³

et al. 2021

Preprint

View full text Add to dashboard Cite

As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques -model architectures, training criteria, decoding hyperparameters, and endpointer parameters -on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly correlated with observed UPL. Thus, conventional algorithmic latency measurements might be inadequate in accurately capturing latency observed when models are deployed on embedded devices. Instead, we find that factors affecting token emission latency, and endpointing behavior significantly impact on UPL. We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.

show abstract

“…Although recent advances on architectural design [7,8] and pre-training method [9] have improved the performance with CTC, it is usually weaker than encoder-decoder models, often credited to its strong conditional independence assumption, and overcoming the performance often requires external language models (LMs) and beam search algorithm [10,11], which demand extra computational costs and effectively makes the model an autoregressive one. Therefore, it is important to improve CTC modeling to reduce overall computational overhead, ideally without the help of LM and beam search.…”

Section: Introductionmentioning

confidence: 99%

Intermediate Loss Regularization for CTC-Based Speech Recognition

Lee¹,

Watanabe

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification (CTC) objective. The proposed objective, an intermediate CTC loss, is attached to an intermediate layer in the CTC encoder network. This intermediate CTC loss well regularizes CTC training and improves the performance requiring only small modification of the code and small and no overhead during training and inference, respectively. In addition, we propose to combine this intermediate CTC loss with stochastic depth training, and apply this combination to a recently proposed Conformer network. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively, based on CTC greedy search without a language model. Especially, the AISHELL-1 task is comparable to other state-of-the-art ASR systems based on autoregressive decoder with beam search.

show abstract

Scaling Up Online Speech Recognition Using ConvNets

Cited by 33 publications

References 22 publications

Layer Pruning on Demand with Intermediate CTC

Layer Pruning on Demand with Intermediate CTC

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Intermediate Loss Regularization for CTC-Based Speech Recognition

Contact Info

Product

Resources

About