2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461893
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Multi-Speaker Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
50
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
1

Relationship

3
4

Authors

Journals

citations
Cited by 71 publications
(51 citation statements)
references
References 5 publications
0
50
0
Order By: Relevance
“…Similar to the single-channel model, the permutation order of the reference sequences R j is determined by (7). The whole MIMO-Speech model is optimized only with ASR loss as in (8).…”
Section: Multi-channel Multi-speaker Asrmentioning
confidence: 99%
See 1 more Smart Citation
“…Similar to the single-channel model, the permutation order of the reference sequences R j is determined by (7). The whole MIMO-Speech model is optimized only with ASR loss as in (8).…”
Section: Multi-channel Multi-speaker Asrmentioning
confidence: 99%
“…In single-channel speech separation, various methods have been proposed, among which deep clustering (DPCL) based methods [2] and permutation invariant training (PIT) based methods [3] are the dominant ones. For ASR, methods combining separation with single-speaker ASR as well as methods skipping the explicit separation step and building directly a multi-speaker speech recognition system have been proposed, using either the hybrid ASR framework [4][5][6] or the end-to-end ASR framework [7][8][9]. In the multi-channel condition, the spatial information derived from the inter-channel differences can help distinguish between speech sources from different directions, which makes the problem easier to solve.…”
Section: Introductionmentioning
confidence: 99%
“…Other works already studied the effectiveness of frequency domain source separation techniques as a front-end for ASR. DPCL and PIT have been efficiently used for this purpose, and it was shown that joint retraining for fine-tuning can improve performance [7,8,10]. E2E systems for single-channel multi-speaker ASR have been proposed that no longer consist of individual parts dedicated for source separation and speech recognition, but combine these functionalities into one large monolithic neural network.…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Based on these source separation techniques, multi-speaker ASR systems have been constructed. DPCL and PIT have been used as frequency domain source separation front-ends for a state-of-theart single-speaker ASR system and extended to jointly trained E2E or hybrid systems [7,8,9,10]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system.…”
Section: Introductionmentioning
confidence: 99%
“…In one line of research using ASR-based training criteria, multispeaker ASR based on permutation invariant training (PIT) has been proposed [4,[13][14][15][16]. With PIT, the label-permutation problem is solved by considering all possible permutations when calculating the loss function [17].…”
Section: Introductionmentioning
confidence: 99%