2022
DOI: 10.48550/arxiv.2203.14250
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Active Speaker Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
19
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(19 citation statements)
references
References 0 publications
0
19
0
Order By: Relevance
“…Our extensive experiments demonstrate the effectiveness of our approach. On AVA-ActiveSpeaker dataset [36], Lo-CoNet achieves an mAP of 95.2%, outperforming current state-of-the-out method EASEE [4] by 1.1% despite using a simpler visual encoder. Furthermore, LoCoNet achieves 68.1% (+22%) on the Columbia dataset [10], 97.2% (+2.8%) on the Talkies dataset [32] and 59.7% (+8%) on the Ego4D dataset [19].…”
Section: Introductionmentioning
confidence: 93%
See 4 more Smart Citations
“…Our extensive experiments demonstrate the effectiveness of our approach. On AVA-ActiveSpeaker dataset [36], Lo-CoNet achieves an mAP of 95.2%, outperforming current state-of-the-out method EASEE [4] by 1.1% despite using a simpler visual encoder. Furthermore, LoCoNet achieves 68.1% (+22%) on the Columbia dataset [10], 97.2% (+2.8%) on the Talkies dataset [32] and 59.7% (+8%) on the Ego4D dataset [19].…”
Section: Introductionmentioning
confidence: 93%
“…Types of encoder network. For the visual encoder, most ASD methods extract visual embeddings using 2D CNNs [20] since they require less GPU memory [3,30,32,38], although some work [4,30] use 3D CNNs [9]. For the audio encoder, existing audio backbones have high temporal downsampling which makes it hard to extract per-frame features [21,29].…”
Section: Active Speaker Detection (Asd)mentioning
confidence: 99%
See 3 more Smart Citations