ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413905
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Knowledge Distillation for RNN-Transducer Models

Abstract: Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(16 citation statements)
references
References 26 publications
0
16
0
Order By: Relevance
“…where y is the ground-truth sequence of tokens, whose length is U , P 0 and P ∞ ∈ R U ×V are the streaming mode and fullcontext mode outputs, respectively, Ltrans(•, •) is the transducer loss and Ldistil(•, •) is the inplace knowledge distillation loss. Instead of taking the direct KL-divergence between P 0 and P ∞ , Dual-mode ASR follows [17] to merge the probabilities of unimportant tokens for the efficient knowledge distillation…”
Section: Dual-mode Asrmentioning
confidence: 99%
See 1 more Smart Citation
“…where y is the ground-truth sequence of tokens, whose length is U , P 0 and P ∞ ∈ R U ×V are the streaming mode and fullcontext mode outputs, respectively, Ltrans(•, •) is the transducer loss and Ldistil(•, •) is the inplace knowledge distillation loss. Instead of taking the direct KL-divergence between P 0 and P ∞ , Dual-mode ASR follows [17] to merge the probabilities of unimportant tokens for the efficient knowledge distillation…”
Section: Dual-mode Asrmentioning
confidence: 99%
“…In practice, although varying depending on applications, latency requirements usually sit at around 300ms (median) and less than 1s (95%-tile). In order to solve this challenge, several methods have been studied, especially based on joint training [16] and knowledge distillation [17]. Recently, a framework called Dual-mode ASR has been introduced, where a single model is trained with two different modes: streaming and full-context [18].…”
Section: Introductionmentioning
confidence: 99%
“…Knowledge distillation (KD) techniques have been used in the context of speech recognition for model compression [16,17,18], domain adaptation [19,20,21,22,23] and transferring knowledge from full-context to streaming scenarios [24,25]. These methods have applied KD both at the sequence level [17,18], and the frame-level [16,23]. The early works on sequence level KD [26,24] used a two-step procedure.…”
Section: Related Workmentioning
confidence: 99%
“…The early works on sequence level KD [26,24] used a two-step procedure. However, a recently proposed method by Panchapagesan et al [18] allows for single-step co-distillation in RNNT models. Yu et al [25] used this loss function for training encoder modules capable of working in both streaming and full-context speech recognition scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…Hard target distillation was also used in the follow-up works [11,13] that further improved the SoTA results on Lib-riSpeech by combining with pre-training. More recently, soft target distillation for RNN-T was explored in [15,16], where the KL divergence between the teacher and student output label distribution is used as the loss function, similar to those used in [4]. However, it was only used for model compression [15] and streaming ASR models [16].…”
Section: Introductionmentioning
confidence: 99%