Recent Developments on Espnet Toolkit Boosted By Conformer

Guo, Pengcheng; Boyer, F.; Chang, Xuankai; Hayashi, Tomoki; Higuchi, Yuki; Inaguma, Hirofumi; Kamo, Naoyuki; Li, Chenda; Garcia‐Romero, Daniel; Shi, Jiatong; Jing, Shi; Watanabe, Shinji; Wei, Kun; Zhang, Wangyou; Zhang, Yuekai

doi:10.1109/icassp39728.2021.9414858

Cited by 142 publications

(67 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Warmup steps were set to 25k, and a learning rate factor was 1.0. Regularization hyperparameters, such as dropout rate and labelsmoothing weight, were the same setup as in [28]. For eval- uation, a final model was obtained by averaging model parameters over 10 checkpoints with the best validation performance.…”

Section: Experimental Conditionsmentioning

confidence: 99%

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Komatsu¹

2022

Preprint

View full text Add to dashboard Cite

This paper proposes CTC-based non-autoregressive ASR with self-conditioned folded encoders. The proposed method realizes non-autoregressive ASR with fewer parameters by folding the conventional stack of encoders into only two blocks; base encoders and folded encoders. The base encoders convert the input audio features into a neural representation suitable for recognition. This is followed by the folded encoders applied repeatedly for further refinement. Applying the CTC loss to the outputs of all encoders enforces the consistency of the input-output relationship. Thus, folded encoders learn to perform the same operations as an encoder with deeper distinct layers. In experiments, we investigate how to set the number of layers and the number of iterations for the base and folded encoders. The results show that the proposed method achieves a performance comparable to that of the conventional method using only 38% as many parameters. Furthermore, it outperforms the conventional method when increasing the number of iterations.

show abstract

Section: Experimental Conditionsmentioning

confidence: 99%

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Komatsu¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…While the MHSA learns the global context, the CONV module efficiently captures the local correlations synchronously. Since the Conformer encoder has shown consistent improvement over a wide range of end-to-end speech processing applications [7], we expect it to compensate for the modeling capacity of CTC and improve the system performance.…”

Section: Conformer Encodermentioning

confidence: 99%

“…There is no doubt that a pure CTC based encoder network can hardly model different speaker's speech simultaneously. When applying the conditional speaker chain based method, both model (7) and model (8) are better than PIT model. By combining the single and multi-speaker mixture speech, model (8) shows a significant improvement, whose WER is 29.5% on the WSJ0-2mix test For our conditional Conformer-CTC model ( 9), we explore two types of conditional features, including the "hard" CTC alignments and "soft" latent features after EncoderRec.…”

Section: Modelsmentioning

confidence: 99%

“…End-to-end architectures have demonstrated their effectiveness and became the dominant models across various sequence to sequence tasks, like neural machine translation (NMT) [1,2] and automatic speech recognition (ASR) [3,4,5,6,7]. However, most of these models follow an autoregressive (AR) strategy, which predicts a target token conditioned on both previously generated tokens and the source input sequence.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

et al. 2021

Self Cite

View full text Add to dashboard Cite

Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditionalmultispk.

show abstract

“…Transformer [3] successfully reduces CERs by replacing the BLSTM on Japanese ASR tasks [4]. Its successor with several modifications for ASR, Conformer [5], further decreases CERs on the Japanese tasks as well as other languages [6].…”

Section: Introductionmentioning

confidence: 99%

A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

et al. 2021

View full text Add to dashboard Cite

End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since wordbased tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art character error rates of 4.1%, 3.2%, and 3.5% for Corpus of Spontaneous Japanese (CSJ) eval1, eval2, and eval3 tasks, respectively. The system is also shown to be computationally efficient thanks to the efficiency of Conformer transducers.

show abstract

Recent Developments on Espnet Toolkit Boosted By Conformer

Cited by 142 publications

References 25 publications

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

Contact Info

Product

Resources

About