Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10340
|View full text |Cite
|
Sign up to set email alerts
|

Pruned RNN-T for fast, memory-efficient ASR training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…This needs to allocate a large amount of memory on graphic processing units (GPU) or tensor processor units (TPU). However, as pointed out in [22] and [24], not all alignment paths have high likelihoods, and most of the probability mass is assigned to the paths that are close to a reasonable alignment. As a by-product of the CTC decoder in our proposed system, we can easily get a CTC alignment by aligning the CTC posterior with the ground truth.…”
Section: Rnn-t With Ctc Guidancementioning
confidence: 98%
See 1 more Smart Citation
“…This needs to allocate a large amount of memory on graphic processing units (GPU) or tensor processor units (TPU). However, as pointed out in [22] and [24], not all alignment paths have high likelihoods, and most of the probability mass is assigned to the paths that are close to a reasonable alignment. As a by-product of the CTC decoder in our proposed system, we can easily get a CTC alignment by aligning the CTC posterior with the ground truth.…”
Section: Rnn-t With Ctc Guidancementioning
confidence: 98%
“…ct,u is the probability of emitting the next symbol lu+1 while sitting at position (t, u), φt,u is the probability of emitting a blank symbol at the same place, whereas in CTC, φt,u does not depend on u, so it becomes φt. alignment can be used to restrict the set of possible paths when calculating RNN-T loss, similar to [22,23] and [24], where the first 2 works use external alignments obtained from another ASR system, while the latter uses an small RNN-T to obtain the alignment on-thefly during training. We validate our method on Librispeech (single domain) [25] and SpeechStew (multi-domain) [26] datasets.…”
Section: Introductionmentioning
confidence: 99%
“…, U } denotes the index in the label sequence at time t. The negative log of this expression is known as the RNN-T or transducer loss. In practice, to make training more memory-efficient, we often approximate the full sum, for example using the pruned transducer loss [30]. We will denote this loss as Lrnnt for the remainder of this paper.…”
Section: Speech Recognition With Neural Transducersmentioning
confidence: 99%
“…(7) Such a synchronization strategy has also recently been proposed for performing word-level diarization using transducers [36]. For both ASR and speaker branches, we use a pruned version of the HAT loss similar to pruned RNNT [30].…”
Section: Synchronizing Speaker Labels With Asr Tokensmentioning
confidence: 99%
“…Recently, there has been a significant advancement in the development of automatic speech recognition (ASR) technology. Traditional methods based on Hidden Markov Models (HMM) [1,2] have been replaced by deep learning based techniques such as Connectionist Temporal Classification (CTC) [3,4], Attention-based Encoder-Decoder (AED) [5,6,7], and Neural Transducer [8,9].…”
Section: Introductionmentioning
confidence: 99%