ESPnet: End-to-End Speech Processing Toolkit

Watanabe, Shinji; Hori, Takaaki; Karita, Shigeki; Hayashi, Tomoki; Nishitoba, Jiro; Unno, Yuya; Soplin, Nelson Enrique Yalta; Heymann, Jahn; Wiesner, Matthew; Chen, Nanxin; Renduchintala, Adithya; Ochiai, Tsubasa

doi:10.21437/interspeech.2018-1456

Cited by 1,075 publications

(626 citation statements)

References 31 publications

Supporting

Mentioning

619

Contrasting

Unclassified

Order By: Relevance

“…All the proposed end-to-end multi-speaker speech recognition models are implemented with the ESPnet framework [29] using the Pytorch backend. Some basic parts are the same for all the models.…”

Section: Methodsmentioning

confidence: 99%

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recently, fully recurrent neural network (RNN) based endto-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Attention-DecoderEnc 1 SD < l a t e x i t s h a 1 _ b a s e 6 4 = " Z E x 5 4 u 5 D h 9 n r k V q v e X 1 a T 1 D L f 2 g = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " / 1 B 3 S g + + 4 5 a r q j n e 5 e M f V V 5 b I N Y = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 B o m w M L K m b V h 9 L o G h m G h C m e T 6 r x Y b U U k Z 6 s 7 K u g R n P v I i a d W q z l m 1 d n t e q d t F H S V y R I 7 J K X H I B a m T G 9 I g T c J I R p 7 J K 3 k z n o w X 4 9 3 4 m I 0 u G c X O A f k D 4 / M H + Y S X P A = = < / l a t e x i t > EncMix < l a t e x i t s h a 1 _ b a s e 6 4 = " t H P Y 8 W F E w 4 g s 1 l 5 Q M 1 a 7 j E W l J 1 Y = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l Q Q Q 3 Q g X 7 g D a E y A 2 T 4 V C s D 7 E d S v x B w r v 7 e S E m g 1 D T w 9 G R A Y K Q W v Z n 4 n 9 d L w L 9 y U h 7 G C T A d M T / k J w J D h G d t 4 A G X j I K Y a k K o 5 P q v m I 6 I J B R 0 Z x V d g r 0 Y e Z m 0 6 z X 7 v F a / v 6 g 2 r K K O M j p G J + g M 2 e g S N d A t a q I W o i h D z + g V v R l P x o v x b n z M R 0 t G s X O I / s D 4 / A

show abstract

Section: Methodsmentioning

confidence: 99%

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We use TED-LIUM2 as a second dataset to investigate how the performance evolves when using a larger training set with a higher number of speakers. We use the ESPnet toolkit [30] to implement and investigate our proposed methods. I-vectors are extracted using the Kaldi toolkit [31].…”

Section: Methodsmentioning

confidence: 99%

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Sarı

Moritz

Hori

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoderdecoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

show abstract

“…Audio segmentation is an essential technique for saving computational resources and complexity. ESPnet [30] was used as an E2E-ASR toolkit through the experiments. The following four methods were compared:…”

Section: Methodsmentioning

confidence: 99%

End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection

Yoshimura

Hayashi

Takeda

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than 0.2. The proposed method is publicly available. 1

show abstract

ESPnet: End-to-End Speech Processing Toolkit

Cited by 1,075 publications

References 31 publications

End-To-End Multi-Speaker Speech Recognition With Transformer

End-To-End Multi-Speaker Speech Recognition With Transformer

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection

Contact Info

Product

Resources

About