End-to-End Neural Speaker Diarization with Self-Attention

This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other's weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.

show abstract

“…One end-to-end approach is called EEND [5,6,7]. They calculate multiple speaker activities, each corresponding to a single speaker.…”

Section: End-to-end Diarization For Overlapping Speechmentioning

confidence: 99%

End-To-End Speaker Diarization as Post-Processing

Horiguchi

García

Fujita

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We also use synthetic datasets (same as [25,26]) to evaluate RP-NSD's performance on highly overlapped speech. The simulated mixtures are made by placing two speakers' speech segments in a single audio file.…”

Section: Datasetsmentioning

confidence: 99%

Speaker Diarization with Region Proposal Network

Huang

Watanabe

Fujita

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they are composed of several independently-optimized modules and cannot deal with the overlapped speech. In this paper, we propose a novel speaker diarization method: Region Proposal Network based Speaker Diarization (RPNSD). In this method, a neural network generates overlapped speech segment proposals, and compute their speaker embeddings at the same time. Compared with standard diarization systems, RP-NSD has a shorter pipeline and can handle the overlapped speech. Experimental results on three diarization datasets reveal that RPNSD achieves remarkable improvements over the state-of-the-art x-vector baseline.Index Termsspeaker diarization, neural network, end-to-end, region proposal network, Faster R-CNN

show abstract

“…Our dataset provides many opportunities to analyze and model multi-speaker affect in parent-child co-reading interactions. For example, an end-to-end neural speaker diarization [12] or multitask learning [51] can be integrated into our current affect prediction systems to jointly learn speaker diarization and individual speaker's valence and arousal. Further, the demographic and developmental profiles can be integrated into the current system to model multi-speaker affect in a personalized or culture-sensitive manner [37,39], as individual differences modulate affective expressions suggested by a theoretical foundation for affect detection [9] and cultural differences were empirically found to exist in some aspects of emotions, particularly emotional arousal level [25].…”

Section: Limitations and Future Workmentioning

confidence: 99%

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

Chen

Zhang

Weninger

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

View full text Add to dashboard Cite

Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing. Traditional approaches frequently require hand-crafted speech features and segmentation of speaker turns. In this work, we design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers, automatically discovering features and time regions relevant to the target speaker's affect. We integrate a local attention mechanism into the end-to-end architecture and compare the performance of three attention implementations-one mean pooling and two weighted pooling methods. Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information and successfully extract the individual's valence and arousal intensity. Here we introduce and use a "dyadic affect in multimodal interaction-parent to child" (DAMI-P2C) dataset collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. In contrast to existing public datasets for affect recognition, each instance for both speakers in the DAMI-P2C dataset is annotated for the perceived affect by three labelers. To encourage more research on the challenging task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly available 1 , including acoustic features of the dyads' raw audios, affect annotations, and a diverse set of developmental, social, and demographic profiles of each dyad.

show abstract

End-to-End Neural Speaker Diarization with Self-Attention

Cited by 170 publications

References 43 publications

End-To-End Speaker Diarization as Post-Processing

End-To-End Speaker Diarization as Post-Processing

Speaker Diarization with Region Proposal Network

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

Contact Info

Product

Resources

About