ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414858
|View full text |Cite
|
Sign up to set email alerts
|

Recent Developments on Espnet Toolkit Boosted By Conformer

Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of endto-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 142 publications
(67 citation statements)
references
References 25 publications
0
67
0
Order By: Relevance
“…Warmup steps were set to 25k, and a learning rate factor was 1.0. Regularization hyperparameters, such as dropout rate and labelsmoothing weight, were the same setup as in [28]. For eval- uation, a final model was obtained by averaging model parameters over 10 checkpoints with the best validation performance.…”
Section: Experimental Conditionsmentioning
confidence: 99%
“…Warmup steps were set to 25k, and a learning rate factor was 1.0. Regularization hyperparameters, such as dropout rate and labelsmoothing weight, were the same setup as in [28]. For eval- uation, a final model was obtained by averaging model parameters over 10 checkpoints with the best validation performance.…”
Section: Experimental Conditionsmentioning
confidence: 99%
“…While the MHSA learns the global context, the CONV module efficiently captures the local correlations synchronously. Since the Conformer encoder has shown consistent improvement over a wide range of end-to-end speech processing applications [7], we expect it to compensate for the modeling capacity of CTC and improve the system performance.…”
Section: Conformer Encodermentioning
confidence: 99%
“…There is no doubt that a pure CTC based encoder network can hardly model different speaker's speech simultaneously. When applying the conditional speaker chain based method, both model (7) and model (8) are better than PIT model. By combining the single and multi-speaker mixture speech, model (8) shows a significant improvement, whose WER is 29.5% on the WSJ0-2mix test For our conditional Conformer-CTC model ( 9), we explore two types of conditional features, including the "hard" CTC alignments and "soft" latent features after EncoderRec.…”
Section: Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Transformer [3] successfully reduces CERs by replacing the BLSTM on Japanese ASR tasks [4]. Its successor with several modifications for ASR, Conformer [5], further decreases CERs on the Japanese tasks as well as other languages [6].…”
Section: Introductionmentioning
confidence: 99%