2022
DOI: 10.31219/osf.io/2pj8s
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ASR-aware end-to-end neural diarization

Abstract: We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 18 publications
0
4
0
Order By: Relevance
“…Later, the same authors upgraded the system by replacing the bidirectional long short-term memory (BLSTM) layers by self-attention modules [19]. Subsequent work has targeted EEND for unknown number of speakers [20], SD for long conversations [21], streaming EEND [22], SD constrained by turn detection (i.e., SCD) in [23], or even leveraging EEND for ASR [24].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Later, the same authors upgraded the system by replacing the bidirectional long short-term memory (BLSTM) layers by self-attention modules [19]. Subsequent work has targeted EEND for unknown number of speakers [20], SD for long conversations [21], streaming EEND [22], SD constrained by turn detection (i.e., SCD) in [23], or even leveraging EEND for ASR [24].…”
Section: Related Workmentioning
confidence: 99%
“…In [32] a text-based SRD for multiparty dialogues is proposed, but limited to SRD. Finally, text-based diarization has been proposed in the past by [22,24]. However, these previous works do not take into account the text structure, grammar, and syntax.…”
Section: Related Workmentioning
confidence: 99%
“…Combing the auxiliary encoder representation with representations from the ASR prediction network allows the speaker branch to leverage lexical content for predicting speaker labels. Such a use of lexical information has been shown to be beneficial for speaker diarization using clusteringbased [32,33] or end-to-end neural approaches [34].…”
Section: Auxiliary Speaker Transducermentioning
confidence: 99%
“…A separately trained ASR system can then be used to transcribe each segment found by speaker diarisation, and obtain speaker-attributed ASR output over long audio streams [2,3]. Recently, end-to-end methods have been proposed for jointly modelling some modules in a speaker diarisation pipeline with an ASR system [4][5][6][7][8][9][10][11][12].…”
Section: Introductionmentioning
confidence: 99%