ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413544
|View full text |Cite
|
Sign up to set email alerts
|

A Real-Time Speaker Diarization System Based on Spatial Spectrum

Abstract: In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting. We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks: (1) to segment and separate overlapping speech from two speakers;(2) to estimate the number of speakers when participants may enter or leave the conversation at any time; (3) to provide accurate speaker identification on short text-independent utter… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
9

Relationship

3
6

Authors

Journals

citations
Cited by 18 publications
(5 citation statements)
references
References 17 publications
0
5
0
Order By: Relevance
“…The DOA of the sound source is proved to be helpful [8]. We train a neural-net-based DOA estimator to obtain a 36-dim probability vector representing the azimuth angles that divide the space with ten-degree intervals.…”
Section: Front-end Processingmentioning
confidence: 99%
“…The DOA of the sound source is proved to be helpful [8]. We train a neural-net-based DOA estimator to obtain a 36-dim probability vector representing the azimuth angles that divide the space with ten-degree intervals.…”
Section: Front-end Processingmentioning
confidence: 99%
“…We release baseline systems along with the Train and Eval data for quick start and reproducible research. For the 8-channel data of AliMeeting recorded by microphone array, we select the first channel to obtain Ali-far, and adopt CDDMA beamformer [41,42] on 8channel data to generate Ali-far-bf. We use prefix Train-*, Eval-* and Test-* to denote generated data associated with Train, Eval and Test sets.…”
Section: Datasets Tracks and Baselinesmentioning
confidence: 99%
“…The AliMeeting corpus contains far-field overlapped audios (Ali-f ar), as well as the corresponding near-field audios (Ali-near), which only record and transcribe the speech of a single speaker. The CDDMA Beamformer [34,35] is applied to Ali-f ar to produce Ali-f ar-bf . To evaluate the performance in a single talker scenario, T est N et, and T est M eeting are adopted.…”
Section: Datasetmentioning
confidence: 99%