Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-80
|View full text |Cite
|
Sign up to set email alerts
|

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion

Abstract: It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertaintybased multimodal fusion scheme. Results obtained show that the proposed mu… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…Recent research has focused on developing new techniques and models to improve ASD performance. Pouthier et al [52] introduced a novel multi-modal fusion scheme based on self-attention and uncertainty to leverage audio and video modalities for ASD. Similarly, Kopuklu et al [53] proposed a pipeline consisting of audio-visual encoding, inter-speaker modeling and temporal modeling stages, known as ASDNet, for detecting active speakers in challenging environments.…”
Section: Active Speaker Detection (Asd)mentioning
confidence: 99%
“…Recent research has focused on developing new techniques and models to improve ASD performance. Pouthier et al [52] introduced a novel multi-modal fusion scheme based on self-attention and uncertainty to leverage audio and video modalities for ASD. Similarly, Kopuklu et al [53] proposed a pipeline consisting of audio-visual encoding, inter-speaker modeling and temporal modeling stages, known as ASDNet, for detecting active speakers in challenging environments.…”
Section: Active Speaker Detection (Asd)mentioning
confidence: 99%
“…However, earlier efforts on ASD were limited to short sequences of frontal faces, which did not accurately reflect the complexities of real-world situations [8,9,10]. With the release of the AVA-ActiveSpeaker dataset [13], which is the first large-scale standard benchmark for ASD tasks, several high-performing large networks have been proposed to model various types of relational information in audiovisual modalities [11,12,14,15,32]. For example, the Unified Context Network (UniCon) [16] achieved superior performance by introducing a relational context module to intensively capture the spatial relationships between all speakers in a video frame.…”
Section: Introductionmentioning
confidence: 99%