ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414198
|View full text |Cite
|
Sign up to set email alerts
|

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

Abstract: For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
26
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 50 publications
(27 citation statements)
references
References 31 publications
0
26
0
1
Order By: Relevance
“…A key factor in improving the performance of CTCbased models is how to handle dependencies between tokens in a non-autoregressive manner. Approaches based on iterative refinement of the output tokens by token decoders are well known, and several methods have been proposed [11,12,13,14]. Mask-CTC [12] is a model that inputs the output of the CTC to a decoder and refines the tokens of low confidence conditioned on the tokens of high confidence, and Align-Refine [13] is a model that inputs the potential alignment into the decoder and refines the model on the alignment space.…”
Section: Introductionmentioning
confidence: 99%
“…A key factor in improving the performance of CTCbased models is how to handle dependencies between tokens in a non-autoregressive manner. Approaches based on iterative refinement of the output tokens by token decoders are well known, and several methods have been proposed [11,12,13,14]. Mask-CTC [12] is a model that inputs the output of the CTC to a decoder and refines the tokens of low confidence conditioned on the tokens of high confidence, and Align-Refine [13] is a model that inputs the potential alignment into the decoder and refines the model on the alignment space.…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by the success of NAR models in NMT, several NAR methods were also proposed to reach the performance of AR models on ASR [16,17,18,19,20,21,22]. Since CTC learns a frame-wise latent alignment between the input speech and output tokens and predicts the target sequence based on a strong conditional independence assumption [23], it can be viewed as an early-stage realization of NAR ASR models.…”
Section: Introductionmentioning
confidence: 99%
“…In [18], Imputer was proposed to iteratively generate a new CTC alignment based on mask prediction. Besides, Mask-CTC [17,20] and Align-Refine [21] aimed to refine a tokenlevel CTC output or latent alignments with the mask prediction. In [19], Tian et al proposed to use the estimated CTC spikes to predict the length of target sequence and adopt the encoder states as the input of decoder.…”
Section: Introductionmentioning
confidence: 99%
“…To accelerate the inference, non-autoregressive transformers (NAT) were proposed for the parallel generation of the output sequence. The idea is widely adopted in neural machine translation (NMT) [4][5][6], automatic speech recognition (ASR) [7][8][9][10][11][12][13][14][15][16][17][18], text-to-speech (TTS) [19,20] and speech translation [21].…”
Section: Introductionmentioning
confidence: 99%
“…Essentially, autoregressive models are also iterative-based since they use a left-to-right generation order and take N iterations to generate a sequence of length N. Hence, the idea of iterative NAT is to adopt a different generation order with less than N iterations to accelerate the inference. Chen et al regarded the transformer decoder as a masked language model that first generates tokens with high confidence [7], while Higuchi et al applied the same idea but based on the connectionist temporal classification (CTC) output [11,16]. In addition, Fujita et al used the idea of the insertion transformer from NMT to generate the output sequence with an arbitrary order [12].…”
Section: Introductionmentioning
confidence: 99%