ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053900
|View full text |Cite
|
Sign up to set email alerts
|

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
125
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 101 publications
(125 citation statements)
references
References 14 publications
0
125
0
Order By: Relevance
“…Some studies [4,12,17] simply concatenate the extracted audio and visual features as the input, and apply a multi-layer perceptron (MLP)-based binary classifier to detect the active speaker at each short video segment, without considering the inter-frame temporal dependency. Others further adopt the backend classifier with temporal structure like recurrent neural network (RNN) [44,45], gated recurrent unit (GRU) [35] and long short-term memory (LSTM) [6,40,50], which have achieved preliminary success. Our proposed TalkNet is motivated by this thought.…”
Section: Active Speaker Detectionmentioning
confidence: 99%
“…Some studies [4,12,17] simply concatenate the extracted audio and visual features as the input, and apply a multi-layer perceptron (MLP)-based binary classifier to detect the active speaker at each short video segment, without considering the inter-frame temporal dependency. Others further adopt the backend classifier with temporal structure like recurrent neural network (RNN) [44,45], gated recurrent unit (GRU) [35] and long short-term memory (LSTM) [6,40,50], which have achieved preliminary success. Our proposed TalkNet is motivated by this thought.…”
Section: Active Speaker Detectionmentioning
confidence: 99%
“…Moreover, the SCF layer still showed its superiority by majorly improving over concatenation and even surpassing a random choice performance despite having to deal with crippling audio data. As for the AVA video testing, both the concatenation and SCF results beat those presented in the dataset paper [25], because even though testing was carried out with only 20 of the original videos, the performance gain was still exemplary in terms of result excellence. Even more so, the SCF layer successfully improved the already excellent baseline performance.…”
Section: Discussionmentioning
confidence: 72%
“…This proposed method deals with audio and video information jointly and makes no assumptions regarding input data. The audiovisual technique was extensively tested against data from two tried and tested ASD databases, specifically the Columbia [15] and the AVA-ActiveSpeaker [25] datasets. The obtained results were compared to those of audio-only and video-only approaches, as well as those of a naive concatenation multimodal baseline.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations