End-to-End Multi-Speaker Speech Recognition

Settle, Shane; Roux, Jonathan Le; Hori, Takaaki; Watanabe, Shinji; Hershey, John R.

doi:10.1109/icassp.2018.8461893

Cited by 71 publications

(51 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to the single-channel model, the permutation order of the reference sequences R j is determined by (7). The whole MIMO-Speech model is optimized only with ASR loss as in (8).…”

Section: Multi-channel Multi-speaker Asrmentioning

confidence: 99%

“…In single-channel speech separation, various methods have been proposed, among which deep clustering (DPCL) based methods [2] and permutation invariant training (PIT) based methods [3] are the dominant ones. For ASR, methods combining separation with single-speaker ASR as well as methods skipping the explicit separation step and building directly a multi-speaker speech recognition system have been proposed, using either the hybrid ASR framework [4][5][6] or the end-to-end ASR framework [7][8][9]. In the multi-channel condition, the spatial information derived from the inter-channel differences can help distinguish between speech sources from different directions, which makes the problem easier to solve.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recently, fully recurrent neural network (RNN) based endto-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Attention-DecoderEnc 1 SD < l a t e x i t s h a 1 _ b a s e 6 4 = " Z E x 5 4 u 5 D h 9 n r k V q v e X 1 a T 1 D L f 2 g = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " / 1 B 3 S g + + 4 5 a r q j n e 5 e M f V V 5 b I N Y = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 B o m w M L K m b V h 9 L o G h m G h C m e T 6 r x Y b U U k Z 6 s 7 K u g R n P v I i a d W q z l m 1 d n t e q d t F H S V y R I 7 J K X H I B a m T G 9 I g T c J I R p 7 J K 3 k z n o w X 4 9 3 4 m I 0 u G c X O A f k D 4 / M H + Y S X P A = = < / l a t e x i t > EncMix < l a t e x i t s h a 1 _ b a s e 6 4 = " t H P Y 8 W F E w 4 g s 1 l 5 Q M 1 a 7 j E W l J 1 Y = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l Q Q Q 3 Q g X 7 g D a E y A 2 T 4 V C s D 7 E d S v x B w r v 7 e S E m g 1 D T w 9 G R A Y K Q W v Z n 4 n 9 d L w L 9 y U h 7 G C T A d M T / k J w J D h G d t 4 A G X j I K Y a k K o 5 P q v m I 6 I J B R 0 Z x V d g r 0 Y e Z m 0 6 z X 7 v F a / v 6 g 2 r K K O M j p G J + g M 2 e g S N d A t a q I W o i h D z + g V v R l P x o v x b n z M R 0 t G s X O I / s D 4 / A

show abstract

“…Similar to the single-channel model, the permutation order of the reference sequences R j is determined by (7). The whole MIMO-Speech model is optimized only with ASR loss as in (8).…”

Section: Multi-channel Multi-speaker Asrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Other works already studied the effectiveness of frequency domain source separation techniques as a front-end for ASR. DPCL and PIT have been efficiently used for this purpose, and it was shown that joint retraining for fine-tuning can improve performance [7,8,10]. E2E systems for single-channel multi-speaker ASR have been proposed that no longer consist of individual parts dedicated for source separation and speech recognition, but combine these functionalities into one large monolithic neural network.…”

Section: Relation To Prior Workmentioning

confidence: 99%

“…Based on these source separation techniques, multi-speaker ASR systems have been constructed. DPCL and PIT have been used as frequency domain source separation front-ends for a state-of-theart single-speaker ASR system and extended to jointly trained E2E or hybrid systems [7,8,9,10]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Training of Time Domain Audio Separation and Recognition

Neumann

Kinoshita

Drude

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0 % on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

show abstract

“…In one line of research using ASR-based training criteria, multispeaker ASR based on permutation invariant training (PIT) has been proposed [4,[13][14][15][16]. With PIT, the label-permutation problem is solved by considering all possible permutations when calculating the loss function [17].…”

Section: Introductionmentioning

confidence: 99%

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Kanda

Horiguchi

Fujita

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

show abstract

End-to-End Multi-Speaker Speech Recognition

Cited by 71 publications

References 5 publications

End-To-End Multi-Speaker Speech Recognition With Transformer

End-To-End Multi-Speaker Speech Recognition With Transformer

End-to-End Training of Time Domain Audio Separation and Recognition

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Contact Info

Product

Resources

About