Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-2155
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

Abstract: Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 32 publications
0
5
0
Order By: Relevance
“…Table 1 compares performance of the proposed CONF-TSASR model with contemporary results on the WSJ0-mixextr datasets. For baselines, we include a conventional ASR model (Conformer-CTC), SpeakerBeam [10], Exformer [23] and Conditional-Conformer-CTC [6]. The first model was trained on single-speaker, the second and third were trained on two speakers and the last model was trained on up to three speakers.…”
Section: Wsj0-mix-extr Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Table 1 compares performance of the proposed CONF-TSASR model with contemporary results on the WSJ0-mixextr datasets. For baselines, we include a conventional ASR model (Conformer-CTC), SpeakerBeam [10], Exformer [23] and Conditional-Conformer-CTC [6]. The first model was trained on single-speaker, the second and third were trained on two speakers and the last model was trained on up to three speakers.…”
Section: Wsj0-mix-extr Resultsmentioning
confidence: 99%
“…As the separation step of BSS is not optimized for ASR, this can be sub-optimal. Multi-speaker ASR approaches [3,4,5,6] and their speaker-attributed variants (SA-ASR) [7,8] generate transcripts as output and are optimized end-to-end for ASR. A characteristic of BSS models and their analogous multi-speaker ASR models is their multiple output branches, one per source.…”
Section: Introductionmentioning
confidence: 99%
“…The main challenges faced in these scenarios include overlapping speech, reverberation caused by distant microphones, and background noise. End-to-end multi-speaker ASR and diarization systems for single-channel [1][2][3][4] and multichannel recordings [5][6][7][8] have recently emerged, demonstrating promising results on meeting transcription tasks.…”
Section: Introductionmentioning
confidence: 99%
“…The earlier approaches for end-to-end multi-speaker ASR and diarization for single-channel recordings lacked information sharing between the ASR and the speaker verification modules [2] and/or required a varying number of speaker encoder (or attention) modules based on the number of speakers [1,3]. The authors of [4] proposed an end-toend single-channel speaker-number invariant Transformerbased speaker-attributed ASR (SA-ASR) system based on serialized output training (SOT).…”
Section: Introductionmentioning
confidence: 99%
“…By using a conditional chain, a mixed audio input with multiple speakers can be sequentially separated into individual outputs, where each output corresponds to a different speaker, with the previous output sequence serving as the conditional input. This approach can also be extended to multitalker ASR tasks to produce the transcription of different speakers directly [28,29]. The limitation of the conditional chain approach is computationally expensive, as it requires multiple iterations to separate each speaker signal from the mixed audio signal.…”
Section: Introductionmentioning
confidence: 99%