2020 25th International Conference on Pattern Recognition (ICPR) 2021
DOI: 10.1109/icpr48806.2021.9412884
|View full text |Cite
|
Sign up to set email alerts
|

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Abstract: Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically cre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…There seems to be not many datasets for the VVAD task. The only competitive state of the art dataset for VVAD that we found was the WildVVAD [7]. WildVVAD is not only 3 times smaller than the VVAD-LRS3 it is also more prone false positive and false negative because of the loose assumption that detected voice activity and a single face in the video equals a speaking sample and every detected face in a video sequence without voice activity is a not speaking sample.…”
Section: Related Workmentioning
confidence: 92%
See 1 more Smart Citation
“…There seems to be not many datasets for the VVAD task. The only competitive state of the art dataset for VVAD that we found was the WildVVAD [7]. WildVVAD is not only 3 times smaller than the VVAD-LRS3 it is also more prone false positive and false negative because of the loose assumption that detected voice activity and a single face in the video equals a speaking sample and every detected face in a video sequence without voice activity is a not speaking sample.…”
Section: Related Workmentioning
confidence: 92%
“…WildVVAD is not only 3 times smaller than the VVAD-LRS3 it is also more prone false positive and false negative because of the loose assumption that detected voice activity and a single face in the video equals a speaking sample and every detected face in a video sequence without voice activity is a not speaking sample. Very high 1-to-1 WildVVAD [7] 13,000 High 1-to-1 LRS3 [22] >100,000 High 1-to-0 CUAVE [20] ∼7,000 Low 1-to-0…”
Section: Related Workmentioning
confidence: 99%