Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1130
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Multi-Speaker Speech Recognition Using Speaker Embeddings and Transfer Learning

Abstract: This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from clean speech. This proposed framework does not require any parallel non-overlapped speech materials and is independent of the number of speakers. Our experimental results on overlapped speech datasets show that joint conditioning on speaker embeddings and transfer learning si… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(16 citation statements)
references
References 35 publications
0
16
0
Order By: Relevance
“…TL as an advanced variant of ML has attained great success in various fields, e.g., speech recognition [8,9], text mining [10], computer vision [11,12], and ubiquitous computing [13,14] over the last two decades. The existing TL approaches are categorized into following three main groups: (1) instance-based [15], (2) modelbased [16,17] and (3) feature-based [18,19] approaches.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…TL as an advanced variant of ML has attained great success in various fields, e.g., speech recognition [8,9], text mining [10], computer vision [11,12], and ubiquitous computing [13,14] over the last two decades. The existing TL approaches are categorized into following three main groups: (1) instance-based [15], (2) modelbased [16,17] and (3) feature-based [18,19] approaches.…”
Section: Related Workmentioning
confidence: 99%
“…where tr −2 (X) displays the inverse of tr(X) square root. [39], which belongs to two Gaussian kernels, due to the quadratic form of equations ( 7) and (8), where the derivative of D(W, X S , X T ) considering W is:…”
Section: Objective Functionmentioning
confidence: 99%
“…In our experiments, we first explore multi-condition training, which involves pooling training samples from different conditions and training them simultaneously [32]. In multi-condition training, we train one model using pooled CS and monolingual data.…”
Section: Transfer Learningmentioning
confidence: 99%
“…Target speech extraction [10] enables the recognition of the target speaker's voice by extracting only his/her voice from the observed mixture. Various approaches have been proposed for speech separation and extraction so far, and even in single-channel setup, they realized drastic improvement on overlapping speech recognition by dealing with interfering speech [5,6,[11][12][13].…”
Section: Introductionmentioning
confidence: 99%