Pruned RNN-T for fast, memory-eﬀicient ASR training

Kuang, Fangjun; Guo, Liyong; Lin, Long; Luo, Mingshuang; Zengwei, Yao,; Povey, Daniel

doi:10.21437/interspeech.2022-10340

Cited by 21 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This needs to allocate a large amount of memory on graphic processing units (GPU) or tensor processor units (TPU). However, as pointed out in [22] and [24], not all alignment paths have high likelihoods, and most of the probability mass is assigned to the paths that are close to a reasonable alignment. As a by-product of the CTC decoder in our proposed system, we can easily get a CTC alignment by aligning the CTC posterior with the ground truth.…”

Section: Rnn-t With Ctc Guidancementioning

confidence: 98%

“…ct,u is the probability of emitting the next symbol lu+1 while sitting at position (t, u), φt,u is the probability of emitting a blank symbol at the same place, whereas in CTC, φt,u does not depend on u, so it becomes φt. alignment can be used to restrict the set of possible paths when calculating RNN-T loss, similar to [22,23] and [24], where the first 2 works use external alignments obtained from another ASR system, while the latter uses an small RNN-T to obtain the alignment on-thefly during training. We validate our method on Librispeech (single domain) [25] and SpeechStew (multi-domain) [26] datasets.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accelerating RNN-T Training and Inference Using CTC guidance

Wang¹,

Chen²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from a co-trained connectionist temporal classification (CTC) model. We made a key assumption that if an encoder embedding frame is classified as a blank frame by the CTC model, it is likely that this frame will be aligned to blank for all the partial alignments or hypotheses in RNN-T and it can be discarded from the decoder input. We also show that this frame reduction operation can be applied in the middle of the encoder, which result in significant speed up for the training and inference in RNN-T. We further show that the CTC alignment, a by-product of the CTC decoder, can also be used to perform lattice reduction for RNN-T during training. Our method is evaluated on the Librispeech and SpeechStew tasks. We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER).

show abstract

Section: Rnn-t With Ctc Guidancementioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

Accelerating RNN-T Training and Inference Using CTC guidance

Wang¹,

Chen²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…, U } denotes the index in the label sequence at time t. The negative log of this expression is known as the RNN-T or transducer loss. In practice, to make training more memory-efficient, we often approximate the full sum, for example using the pruned transducer loss [30]. We will denote this loss as Lrnnt for the remainder of this paper.…”

Section: Speech Recognition With Neural Transducersmentioning

confidence: 99%

“…(7) Such a synchronization strategy has also recently been proposed for performing word-level diarization using transducers [36]. For both ASR and speaker branches, we use a pruned version of the HAT loss similar to pruned RNNT [30].…”

Section: Synchronizing Speaker Labels With Asr Tokensmentioning

confidence: 99%

DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs

Raj

García-Perera

Huang

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework further by proposing methods to perform speaker-attributed transcription with SURT, for both short mixtures and long recordings. We achieve this by adding an auxiliary speaker branch to SURT, and synchronizing its label prediction with ASR token prediction through HAT-style blank factorization. In order to ensure consistency in relative speaker labels across different utterance groups in a recording, we propose "speaker prefixing" -appending each chunk with high-confidence frames of speakers identified in previous chunks, to establish the relative order. We perform extensive ablation experiments on synthetic Lib-riSpeech mixtures to validate our design choices, and demonstrate the efficacy of our final model on the AMI corpus.

show abstract

“…Recently, there has been a significant advancement in the development of automatic speech recognition (ASR) technology. Traditional methods based on Hidden Markov Models (HMM) [1,2] have been replaced by deep learning based techniques such as Connectionist Temporal Classification (CTC) [3,4], Attention-based Encoder-Decoder (AED) [5,6,7], and Neural Transducer [8,9].…”

Section: Introductionmentioning

confidence: 99%

Assembling Alibaba:

Lin¹

2021

Engaging Social Media in China

View full text Add to dashboard Cite

Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased towards high-frequency tokens. Representation learning of rare tokens is poor and PLMs have limited performance on downstream tasks. To alleviate this frequency bias issue, we propose two simple and effective Weighted Sampling strategies for masking tokens based on token frequency and training loss. We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT). Experiments on the Semantic Textual Similarity benchmark (STS) show that WSBERT significantly improves sentence embeddings over BERT. Combining WSBERT with calibration methods and prompt learning further improves sentence embeddings. We also investigate fine-tuning WSBERT on the GLUE benchmark and show that Weighted Sampling also improves the transfer learning capability of the backbone PLM. We further analyze and provide insights into how WSBERT improves token embeddings.

show abstract

Pruned RNN-T for fast, memory-eﬀicient ASR training

Cited by 21 publications

References 0 publications

Accelerating RNN-T Training and Inference Using CTC guidance

Accelerating RNN-T Training and Inference Using CTC guidance

DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs

Assembling Alibaba:

Contact Info

Product

Resources

About