2023
DOI: 10.48550/arxiv.2301.08237
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LoCoNet: Long-Short Context Network for Active Speaker Detection

Abstract: Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 24 publications
0
2
0
Order By: Relevance
“…2. Without fine-tuning, our method achieves a state-of-the-art average F1 score of 81.1% on the Columbia dataset compared with TalkNet [36] and LoCoNet [42], showing good robustness.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 97%
“…2. Without fine-tuning, our method achieves a state-of-the-art average F1 score of 81.1% on the Columbia dataset compared with TalkNet [36] and LoCoNet [42], showing good robustness.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 97%
“…Due to the massive-scale and unconstrained nature of Ego4D, it has proved to be useful for various tasks including action recognition (Liu et al, 2022a;Lange et al, 2023), action detection (Wang et al, 2023a), visual question answering (Bärmann & Waibel, 2022), active speaker detection (Wang et al, 2023d), natural language localisation , natural language queries (Ramakrishnan et al, 2023), gaze estimation (Lai et al, 2022), persuasion modelling for conversational agents (Lai et al, 2023b), audio visual object localisation (Huang et al, 2023a), hand-object segmentation (Zhang et al, 2022b) and action anticipation (Ragusa et al, 2023a;Pasca et al, 2023;Mascaró et al, 2023). New tasks have also been introduced thanks to the diversity of Ego4D, e.g.…”
Section: General Datasetsmentioning
confidence: 99%