A Combined Feature Approach for Speaker Segmentation Using Convolution Neural Network

Jiang, Zhong Ping; Zhang, Pan; Li, Xue

doi:10.1007/978-3-319-77383-4_54

Cited by 2 publications

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Methods that use long-term conversational features [13] have also been put forward along with multimodal techniques that use multiple microphone and camera systems [14]. More recently, deep learning approaches have become increasingly prevalent [15]- [18] which often require large amounts of labelled training data.…”

Section: Introductionmentioning

confidence: 99%

Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking of Fundamental Frequency

Hogg

Evers

Moore

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker's utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity.

show abstract