Multimodal Speaker Diarization Utilizing Face Clustering Information

Kapsouras, Ioannis; Tefas, Anastasios; Nikolaidis, Nikos; Pitas, Ioannis

doi:10.1007/978-3-319-21963-9_50

Cited by 1 publication

(1 citation statement)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other studies focused on the problem of speaker recognition without naming, using the speech modality as a single source of information. While some of these studies attempted to incorporate the visual modality, their goal was to cluster the speech segments rather than name the speakers arXiv:1809.08761v1 [cs.CL] 24 Sep 2018 (Erzin et al, 2005;Bost and Linares, 2014;Kapsouras et al, 2015;Bredin and Gelly, 2016;Hu et al, 2015;Ren et al, 2016). None of these studies used textual information (e.g., dialogue), which prevented them from identifying speaker names.…”

Section: Introductionmentioning

confidence: 99%

Speaker Naming in Movies

Azab¹,

Wang²,

Smith³

et al. 2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms several competitive baselines on the average weighted F-score metric. To demonstrate the effectiveness of our framework, we design an end-to-end memory network model that leverages our speaker naming model and achieves state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

show abstract

Section: Introductionmentioning

confidence: 99%