Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475587
|View full text |Cite
|
Sign up to set email alerts
|

Is Someone Speaking?

Abstract: Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and vis… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
63
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 91 publications
(65 citation statements)
references
References 38 publications
2
63
0
Order By: Relevance
“…This is the author's version which has not been fully edited and To address the limitations of SCMIA, we propose to incorporate the AV-activity information, which can complement the speaker's identity information. We use AV-activity-based models for ASD such as TalkNet [25] and Syncnet [11] to gather the concerning AV-activity information.…”
Section: Volume mentioning
confidence: 99%
See 4 more Smart Citations
“…This is the author's version which has not been fully edited and To address the limitations of SCMIA, we propose to incorporate the AV-activity information, which can complement the speaker's identity information. We use AV-activity-based models for ASD such as TalkNet [25] and Syncnet [11] to gather the concerning AV-activity information.…”
Section: Volume mentioning
confidence: 99%
“…For GSCMIA, we employ two strategies to gather AVactivity ASD predictions: i) TalkNet [25] and ii) Syncnet [11]. Talknet [25] is one of the state-of-the-art methods that observe a face track and the concurrent audio waveform and models the short-term and long-term temporal context to provide face-box-wise active speaker predictions. It has been trained in a fully-supervised manner using the active speaker annotations in each frame.…”
Section: Experiments and Implementation Details A Implementation Detailsmentioning
confidence: 99%
See 3 more Smart Citations