2015
DOI: 10.1007/978-3-319-21963-9_50
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Speaker Diarization Utilizing Face Clustering Information

Abstract: Abstract. Multimodal clustering/diarization tries to answer the question "who spoke when" by using audio and visual information. Diarization consists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering. We tested the proposed method in … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2018
2018
2018
2018

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 10 publications
0
1
0
Order By: Relevance
“…Other studies focused on the problem of speaker recognition without naming, using the speech modality as a single source of information. While some of these studies attempted to incorporate the visual modality, their goal was to cluster the speech segments rather than name the speakers arXiv:1809.08761v1 [cs.CL] 24 Sep 2018 (Erzin et al, 2005;Bost and Linares, 2014;Kapsouras et al, 2015;Bredin and Gelly, 2016;Hu et al, 2015;Ren et al, 2016). None of these studies used textual information (e.g., dialogue), which prevented them from identifying speaker names.…”
Section: Introductionmentioning
confidence: 99%
“…Other studies focused on the problem of speaker recognition without naming, using the speech modality as a single source of information. While some of these studies attempted to incorporate the visual modality, their goal was to cluster the speech segments rather than name the speakers arXiv:1809.08761v1 [cs.CL] 24 Sep 2018 (Erzin et al, 2005;Bost and Linares, 2014;Kapsouras et al, 2015;Bredin and Gelly, 2016;Hu et al, 2015;Ren et al, 2016). None of these studies used textual information (e.g., dialogue), which prevented them from identifying speaker names.…”
Section: Introductionmentioning
confidence: 99%