Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3116
|View full text |Cite
|
Sign up to set email alerts
|

Who Said That?: Audio-Visual Speaker Diarisation of Real-World Meetings

Abstract: The goal of this work is to determine who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs.To achieve this, we propose a novel iterative approach that first enrolls speaker models using audio-visual correspondence, then uses the enrolled models together with the visual information to determine the active speaker.We show strong quantitative and qualitative performance on a dataset of real-world meetings… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
29
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 33 publications
(29 citation statements)
references
References 30 publications
0
29
0
Order By: Relevance
“…Therefore, several methods that leverage audio and visual cues for diarization are motivated by the synergy between utterances and lip movements. These methods adopt techniques such as mutual information [26,47], canonical correlation analysis [28], and deep learning [10,12,13]. In recent works, audio-visual correspondence is also used for associating talking faces and voice tracks [9,60,67].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Therefore, several methods that leverage audio and visual cues for diarization are motivated by the synergy between utterances and lip movements. These methods adopt techniques such as mutual information [26,47], canonical correlation analysis [28], and deep learning [10,12,13]. In recent works, audio-visual correspondence is also used for associating talking faces and voice tracks [9,60,67].…”
Section: Related Workmentioning
confidence: 99%
“…Thus, completely off-screen speakers deserve further investigation. While early work in speaker diarization solely depends on the audio stream [54,64,71], recent works [9,10,22] start to attempt leveraging both audio and visual cues. These methods either fuse the similarity scores of two independent modalities or incorporate the synchronization between utterance and lip motion.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In addition to text, vision can be combined with speech as well. Such tasks include audio-visual speech recognition [50]- [52], speaker recognition [53]- [55], as well as speech diarisation [56], [57], separation [58], [59] and enhancement [60], which mostly focused on the use of visual features to improve the robustness of the audio-only methods.…”
Section: Introductionmentioning
confidence: 99%
“…Previously, the frame-level portion of these extractors was based on TDNN (time delay neural network) blocks that contained only five convolutional layers with temporal context [6]. Such types of embeddings, referred to as x-vectors, are often applied to diarization tasks in state-of-the-art systems [12]. The newer ECAPA-TDNN (emphasized channel attention, propagation and aggregation TDNN) [13] architecture develops the idea of TDNN and contains additional blocks with hierarchical filters for the extraction of different scale features.…”
Section: Introductionmentioning
confidence: 99%