2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2022
DOI: 10.1109/iscslp57327.2022.10037846
|View full text |Cite
|
Sign up to set email alerts
|

TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 33 publications
0
4
0
Order By: Relevance
“…The system developed outputs the ensemble results of the four modules: self-attentive-based VAD, uniform segmentation, ECAPA-TDNN-based embedding extractor, and spectral clustering. This is the same challenge as in paper [12] above. stands for conversational short-phrase speaker diarization dataset and has three features.…”
Section: Fig -2mentioning
confidence: 87%
See 2 more Smart Citations
“…The system developed outputs the ensemble results of the four modules: self-attentive-based VAD, uniform segmentation, ECAPA-TDNN-based embedding extractor, and spectral clustering. This is the same challenge as in paper [12] above. stands for conversational short-phrase speaker diarization dataset and has three features.…”
Section: Fig -2mentioning
confidence: 87%
“…Table -2 It is evident that CDER falls with increasing duration. Also, the paper [12] shows that SC, TS-VAD and EEND are explored and then DOVER-LAP was applied to fuse the RTTM outputs inferred from the above three systems to get the result. Paper [12] shows that spectral clustering-based speaker diarization on metric CDER performs best with 12.0% and 9.5% on the dev set and test set respectively and hence is competitive with the new CDER metric.…”
Section: Fig -2mentioning
confidence: 99%
See 1 more Smart Citation
“…Using visual information as the complementary modality to improve diarization systems becomes a promising direction. Existing works mainly depend on constructing cross-modal synergy [25]- [28], clustering on audio-visual pairs [29], [30], or end-to-end audio-visual diarization [31], [32], which are basically derived from the previous audio-only methods. Motivated by the highly accurate performance in TS-VAD studies, a question arises if it is feasible to investigate this framework in an audio-visual manner.…”
Section: Introductionmentioning
confidence: 99%