2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003959
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Neural Speaker Diarization with Self-Attention

Abstract: Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
175
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
5

Relationship

3
7

Authors

Journals

citations
Cited by 170 publications
(196 citation statements)
references
References 43 publications
1
175
0
Order By: Relevance
“…One end-to-end approach is called EEND [5,6,7]. They calculate multiple speaker activities, each corresponding to a single speaker.…”
Section: End-to-end Diarization For Overlapping Speechmentioning
confidence: 99%
“…One end-to-end approach is called EEND [5,6,7]. They calculate multiple speaker activities, each corresponding to a single speaker.…”
Section: End-to-end Diarization For Overlapping Speechmentioning
confidence: 99%
“…We also use synthetic datasets (same as [25,26]) to evaluate RP-NSD's performance on highly overlapped speech. The simulated mixtures are made by placing two speakers' speech segments in a single audio file.…”
Section: Datasetsmentioning
confidence: 99%
“…Our dataset provides many opportunities to analyze and model multi-speaker affect in parent-child co-reading interactions. For example, an end-to-end neural speaker diarization [12] or multitask learning [51] can be integrated into our current affect prediction systems to jointly learn speaker diarization and individual speaker's valence and arousal. Further, the demographic and developmental profiles can be integrated into the current system to model multi-speaker affect in a personalized or culture-sensitive manner [37,39], as individual differences modulate affective expressions suggested by a theoretical foundation for affect detection [9] and cultural differences were empirically found to exist in some aspects of emotions, particularly emotional arousal level [25].…”
Section: Limitations and Future Workmentioning
confidence: 99%