2020
DOI: 10.1109/tcds.2019.2927941
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition

Abstract: This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 16 publications
(11 citation statements)
references
References 43 publications
0
10
0
Order By: Relevance
“…These researchers were part of the participants at the ActivityNet Challenge 2019 -Task B Active Speaker Detection (AVA) using the AVA ActiveSpeaker dataset. The design in [9] made use of only visual cues to determine active speakers in videos weakly supervised by the audio stream for automatic labelling of the image frames. Once the face image frames were labeled through stochastic optimization, features were extracted using a CNN which were classi ed experimenting with a non-temporal (Perceptron) and a temporal (LSTM model).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…These researchers were part of the participants at the ActivityNet Challenge 2019 -Task B Active Speaker Detection (AVA) using the AVA ActiveSpeaker dataset. The design in [9] made use of only visual cues to determine active speakers in videos weakly supervised by the audio stream for automatic labelling of the image frames. Once the face image frames were labeled through stochastic optimization, features were extracted using a CNN which were classi ed experimenting with a non-temporal (Perceptron) and a temporal (LSTM model).…”
Section: Related Workmentioning
confidence: 99%
“…ASD seeks to classify if a given face at a given time in a video is speaking or not [1] [2]. Active speaker determination proves useful in a number of tasks such as human-computer/human-robot interactions [3], where in a eld of view of multiple speakers, a robot needs to know who is talking in order to turn its head or x it's gaze in that direction to visually pay attention and better manage conversations with the different speakers [4], audio-visual diarization (auto annotation of descriptions in video scenes) [5], allowing deaf audience to better appreciate movies [6], video conferencing systems to allow zooming in on the current speaker [7], a necessary step in the auto curation of audio samples from videos where the face image of the subjects are known [8], speaker naming where in addition to detecting the active speaker, the identity is also made known [6], speech enhancement, video re-targeting for meetings [1] and is a basic prerequisite for arti cial cognitive systems in the acquisition of language in social settings [9]. Research in active speaker detection from videos is faced with challenges such as presence of multiple people leading to variability of possible speakers in a video, poor resolution [10], visibility of speaker in video (speakers who are off screen) [11], faces turned at inconvenient in-plane angles to the recording camera, recordings from YouTube are from varying demographics, have different illumination settings and faces are occluded in some cases.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Such techniques are successful, for instance, as a means to recognize speech related facial motion while also discerning it from other unrelated movements. In [16,19], Stefanov et al used the outputs of a face detector, along with audio-derived voice activity detection (VAD) labels, to train a CNN task specific feature extractor together with a Perceptron classifier. For transfer learning comparison, known CNN architectures were used as pre-trained feature extractors whose outputs were employed in training temporal (LSTM net) and nontemporal (Perceptron) classifiers.…”
Section: Asd Recent Researchmentioning
confidence: 99%
“…Robot [136,[174][175][176][177][178] Computer vision [135,136,[178][179][180][181] Natural language processing [182,183] Reinforcement…”
Section: Automatic Generation Of Label Datamentioning
confidence: 99%