2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953192
|View full text |Cite
|
Sign up to set email alerts
|

Speaker segmentation using deep speaker vectors for fast speaker change scenarios

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(10 citation statements)
references
References 7 publications
0
10
0
Order By: Relevance
“…Most of the DNN based SD systems introduced in Section 1 use DNN to describe a speaker in a relatively short segment of conversation and then compare two representations of adjacent segments (e.g. so called d-vectors [12]) to decide if the speaker change occurred. On the contrary, our approach using the CNN-based SCD finds the possible speaker changes in spectogram and additionally uses the information for the refinement of accumulation process of statistics.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of the DNN based SD systems introduced in Section 1 use DNN to describe a speaker in a relatively short segment of conversation and then compare two representations of adjacent segments (e.g. so called d-vectors [12]) to decide if the speaker change occurred. On the contrary, our approach using the CNN-based SCD finds the possible speaker changes in spectogram and additionally uses the information for the refinement of accumulation process of statistics.…”
Section: Discussionmentioning
confidence: 99%
“…The speaker change detection (SCD) is often applied to the audio signal to obtain segments which ideally contain a speech of a single speaker [2]. Commonly used approaches to the SCD include the Bayesian Information Criterion (BIC), Generalized Likelihood Ratio (GLR), Kullback-Leibler divergence [8,9], Support Vector Machine (SVM) [10] and Deep Neural Networks (DNNs) [11,12]. However, in a spontaneous telephone conversation containing very short speaker turns and frequent overlapping speech, diarization systems often omit the SCD process and use a simple constant length window segmentation of speech [3,5].…”
Section: Introductionmentioning
confidence: 99%
“…Alternatively, deep neural networks (DNNs) have also been successfully utilized to extract complex features [7,8]. Furthermore, d-vectors were presented in [9], yielding excellent results. The latest trend goes in the direction of deep speaker embeddings [10,11] designed for endto-end systems.…”
Section: Related Workmentioning
confidence: 99%
“…After extracting and segmenting audio recording into frames with 25ms length, frames are processed by VAD method introduced in [24] to classify whether the frame is speech or not. We use speech segment with 0.1s window length proposed in [25] to split frames into homogeneous segments discriminately, and then feed these segments into deep recurrent convolutional neural network proposed in [17] to learn speaker embedding.…”
Section: System Architecturementioning
confidence: 99%