2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383615
|View full text |Cite
|
Sign up to set email alerts
|

ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 70 publications
(28 citation statements)
references
References 38 publications
0
28
0
Order By: Relevance
“…With the exception of GPT-3 and ERNIE 3.0, most PTMs in the literature can be downloaded from hubs such as PaddleHub, 5 HuggingFaceHub, 6 and Fairseq 7 (Ott et al 2019). Historically, many of these hubs started with models for natural language processing, though they are beginning to expand into other fields such as speech (SpeechBrain 8 (Ravanelli et al 2021) and ESPnet 9 (Watanabe et al 2018;Hayashi et al 2020;Inaguma et al 2020;Li et al 2021)) and vision (VIT 10 (Dosovitskiy et al 2021;Tolstikhin et al 2021;Steiner et al 2021;Chen, Hsieh, and Gong 2021)).…”
Section: Pre-training Fine-tuning and Inferencementioning
confidence: 99%
“…With the exception of GPT-3 and ERNIE 3.0, most PTMs in the literature can be downloaded from hubs such as PaddleHub, 5 HuggingFaceHub, 6 and Fairseq 7 (Ott et al 2019). Historically, many of these hubs started with models for natural language processing, though they are beginning to expand into other fields such as speech (SpeechBrain 8 (Ravanelli et al 2021) and ESPnet 9 (Watanabe et al 2018;Hayashi et al 2020;Inaguma et al 2020;Li et al 2021)) and vision (VIT 10 (Dosovitskiy et al 2021;Tolstikhin et al 2021;Steiner et al 2021;Chen, Hsieh, and Gong 2021)).…”
Section: Pre-training Fine-tuning and Inferencementioning
confidence: 99%
“…We adopt the 5-th channel (CH5) as the reference channel for both training and evaluation. For evaluating the performance of frontend models on the real data, we adopt an E2E ASR model pretrained on the CHiME-4 dataset, which was also used in Section 4.1 in [24]. For the joint training of frontend and backend, we optionally include an additional dataset from the Wall Street Journal (WSJ) corpus [25] for training, which consists of 37416 clean speech samples.…”
Section: Experiments 31 Experimental Setupmentioning
confidence: 99%
“…All our models are built based on the ESPnet toolkit [24,27]. The MC-Conv-TasNet model uses a Conv1D layer with 5 input channels and 256 output channels for the multi-channel encoder, with a kernel size of 20 and stride of 10.…”
Section: Experiments 31 Experimental Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…Thus, we can easily evaluate the E2E-ASR performance of pretrained SSLRs available in S3PRL using the current state-ofthe-art (SOTA) neural network models, such as Transformers [4,5] and Conformers [6]. We can also easily evalute the SSLRs in other downstream tasks, including speech translation (ST) [30] and speech enhancement (SE) [31]. It is also an interesting question in the air about the generalization ability of these SSLRs, given the fact that most of SSLRs were trained and tested mainly on LibriSpeech [7,32].…”
Section: Introductionmentioning
confidence: 99%