2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01248
|View full text |Cite
|
Sign up to set email alerts
|

Active Speakers in Context

Abstract: Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our Active Speaker Context is designed to learn pairwise and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
86
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 47 publications
(86 citation statements)
references
References 25 publications
0
86
0
Order By: Relevance
“…Two competitors [6,17] achieved a higher mAP than the baseline provided by Roth et al [16]. Both models depended on a lip synchronisation preprocessing step, and could only achieve a high performance when working with short-term time spans and usually in scenarios in which there was only one person speaking [1,6,17].…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Two competitors [6,17] achieved a higher mAP than the baseline provided by Roth et al [16]. Both models depended on a lip synchronisation preprocessing step, and could only achieve a high performance when working with short-term time spans and usually in scenarios in which there was only one person speaking [1,6,17].…”
Section: Related Workmentioning
confidence: 99%
“…To address the shortcoming of previous models, Alcázar et al [1] propose Active Speakers in Context (ASC), a model whose main intuition is to leverage active speaker context from long-term inter-speaker relations. It differs from previous approaches by using not only the information of the face of the target individual and of the audio input, but also that of the faces of other individuals detected at the same timestamp [1]. The addition of the information from the context in which a speaking activity happens grants ASC an mAP higher than that of Zhang et al [17], but still lower than that of the ensemble models of Chung [6].…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations