2022
DOI: 10.1007/978-3-031-19833-5_22
|View full text |Cite
|
Sign up to set email alerts
|

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Abstract: We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(22 citation statements)
references
References 34 publications
0
16
0
Order By: Relevance
“…• Extensive experiments on AVA-ActiveSpeaker [32], a benchmark dataset for active speaker detection released by Google, show that our method is comparable to the state-of-the-art method [22] while still reducing model parameters by 95.6% and FLOPs by 76.9%.…”
Section: Introductionmentioning
confidence: 83%
See 2 more Smart Citations
“…• Extensive experiments on AVA-ActiveSpeaker [32], a benchmark dataset for active speaker detection released by Google, show that our method is comparable to the state-of-the-art method [22] while still reducing model parameters by 95.6% and FLOPs by 76.9%.…”
Section: Introductionmentioning
confidence: 83%
“…Figure 1 visualizes multiple metrics of different active speaker detection approaches. The experimental results show that our active speaker detection method (1.0M params, 0.6G FLOPs, 94.1% mAP) significantly reduces the model size and computational cost, and its performance is still comparable to the state-of-the-art method [22] (22.5M params, 2.6G FLOPs, 94.2% mAP) on the benchmark. Moreover, our method shows good robustness in cross-dataset testing.…”
Section: Introductionmentioning
confidence: 93%
See 1 more Smart Citation
“…MAAS-TAN [71] proposes a different multimodal graph approach. Following MAAS-TAN, SPELL [73] presents a model that achieved superior performance by proposing an efficient graph-based framework. It is a multimodal graph from the audiovisual data and casts the active speaker detection as a graph node classification task.…”
Section: Related Workmentioning
confidence: 99%
“…MAA-TAN [71] employed graph neural networks approach. SPELL [73] proposed a learning graph-based representation that can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. Unicon [74] proposed a unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audiovisual affinities with each other, and temporal context to aggregate long term information and smooth out local uncertainties.…”
Section: Comparison With State-of-the-artmentioning
confidence: 99%