Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Min, Kyle; Roy, Sourya; Tripathi, Subarna; Guha, Tanaya; Majumdar, Somdeb

doi:10.1007/978-3-031-19833-5_22

Cited by 20 publications

(22 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Extensive experiments on AVA-ActiveSpeaker [32], a benchmark dataset for active speaker detection released by Google, show that our method is comparable to the state-of-the-art method [22] while still reducing model parameters by 95.6% and FLOPs by 76.9%.…”

Section: Introductionmentioning

confidence: 83%

“…Figure 1 visualizes multiple metrics of different active speaker detection approaches. The experimental results show that our active speaker detection method (1.0M params, 0.6G FLOPs, 94.1% mAP) significantly reduces the model size and computational cost, and its performance is still comparable to the state-of-the-art method [22] (22.5M params, 2.6G FLOPs, 94.2% mAP) on the benchmark. Moreover, our method shows good robustness in cross-dataset testing.…”

Section: Introductionmentioning

confidence: 93%

“…mAP vs. FLOPs, size ∝ parameters. The mAP of different active speaker detection methods [1,2,18,22,36,44] on the benchmark and the FLOPs required to predict one frame containing three candidates. The size of the blobs is proportional to the number of model parameters.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Light Weight Model for Active Speaker Detection

Liao¹,

Duan²,

Kanghui³

et al. 2023

Preprint

View full text Add to dashboard Cite

Active speaker detection is a challenging task in audiovisual scenario understanding, which aims to detect who is speaking in one or more speakers scenarios. This task has received extensive attention as it is crucial in applications such as speaker diarization, speaker tracking, and automatic video editing. The existing studies try to improve performance by inputting multiple candidate information and designing complex models. Although these methods achieved outstanding performance, their high consumption of memory and computational power make them difficult to be applied in resource-limited scenarios. Therefore, we construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1.0M vs. 22.5M, about 23×) and FLOPs (0.6G vs. 2.6G, about 4×). In addition, our framework also performs well on the Columbia dataset showing good robustness. The code and model weights are available at https: //github.com/Junhua-Liao/Light-ASD.

show abstract

Section: Introductionmentioning

confidence: 83%

Section: Introductionmentioning

confidence: 93%

See 1 more Smart Citation

A Light Weight Model for Active Speaker Detection

Liao¹,

Duan²,

Kanghui³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…MAAS-TAN [71] proposes a different multimodal graph approach. Following MAAS-TAN, SPELL [73] presents a model that achieved superior performance by proposing an efficient graph-based framework. It is a multimodal graph from the audiovisual data and casts the active speaker detection as a graph node classification task.…”

Section: Related Workmentioning

confidence: 99%

“…MAA-TAN [71] employed graph neural networks approach. SPELL [73] proposed a learning graph-based representation that can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. Unicon [74] proposed a unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audiovisual affinities with each other, and temporal context to aggregate long term information and smooth out local uncertainties.…”

Section: Comparison With State-of-the-artmentioning

confidence: 99%

Efficient Audiovisual Fusion for Active Speaker Detection

Tesema

Song

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV's effectiveness and robustness than the concatenation operation.

show abstract

Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding

Tran,

Kim,

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Cited by 20 publications

References 34 publications

A Light Weight Model for Active Speaker Detection

A Light Weight Model for Active Speaker Detection

Efficient Audiovisual Fusion for Active Speaker Detection

Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding

Contact Info

Product

Resources

About