2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 2019
DOI: 10.1109/iccvw.2019.00460
|View full text |Cite
|
Sign up to set email alerts
|

Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Abstract: Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-Active… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
53
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 29 publications
(53 citation statements)
references
References 36 publications
0
53
0
Order By: Relevance
“…Action classification datasets include Kinetics, a video dataset for human action classification (Kay et al, 2017), ActivityNet, a video dataset for action classification and temporal localization (Fabian Caba Heilbron and Niebles, 2015), and AVA, a dataset of spatio-temporally localized atomic visual actions (AVA) . Multi-modal AI datasets include AVA-ActiveSpeaker, an audio-visual dataset for speaker detection (Roth et al, 2019), VGG lip reading dataset, an audio-visual dataset for speech recognition and separation , Mosi, a multimodal corpus of sentiment intensity (Zadeh et al, 2017(Zadeh et al, , 2016, and OpenFace, a multi-modal face recognition (Baltrušaitis et al, 2016). The two major advantages of EgoCom are egocentricity and the inclusion of multiple participant's synchronized audio and video, which as we show, simplifies multi-speaker applications.…”
Section: Related Workmentioning
confidence: 99%
“…Action classification datasets include Kinetics, a video dataset for human action classification (Kay et al, 2017), ActivityNet, a video dataset for action classification and temporal localization (Fabian Caba Heilbron and Niebles, 2015), and AVA, a dataset of spatio-temporally localized atomic visual actions (AVA) . Multi-modal AI datasets include AVA-ActiveSpeaker, an audio-visual dataset for speaker detection (Roth et al, 2019), VGG lip reading dataset, an audio-visual dataset for speech recognition and separation , Mosi, a multimodal corpus of sentiment intensity (Zadeh et al, 2017(Zadeh et al, , 2016, and OpenFace, a multi-modal face recognition (Baltrušaitis et al, 2016). The two major advantages of EgoCom are egocentricity and the inclusion of multiple participant's synchronized audio and video, which as we show, simplifies multi-speaker applications.…”
Section: Related Workmentioning
confidence: 99%
“…An early work in multimodal ASD used TDNN [7], and there has been significant recent work using DNN-based ASD (e.g., [8], [9], [10], [11], [12], [13], [14]). There is now a large dataset created for this task [15] with an ASD competition [10]. However, we are not aware of any ASD (DNN or otherwise) that does low-latency accurate PTZ without large 2D SLL arrays.…”
Section: Related Workmentioning
confidence: 99%
“…• Television: To create a simulated environment similar to background television sounds, we first extract the audio from four publicly available video datasets: AVA-ActiveSpeaker dataset [26], advertisement dataset [27], and two compilation videos for TV shows "Friends" and "How I met your mother" available in YouTube. For every training utterance, a random segment from one of these four datasets is picked and added with the original signal with 13-20dB SNR.…”
Section: Data Augmentationmentioning
confidence: 99%