Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition

Stefanov, Kalin; Beskow, Jonas; Salvi, Giampiero

doi:10.1109/tcds.2019.2927941

Cited by 16 publications

(11 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These researchers were part of the participants at the ActivityNet Challenge 2019 -Task B Active Speaker Detection (AVA) using the AVA ActiveSpeaker dataset. The design in [9] made use of only visual cues to determine active speakers in videos weakly supervised by the audio stream for automatic labelling of the image frames. Once the face image frames were labeled through stochastic optimization, features were extracted using a CNN which were classi ed experimenting with a non-temporal (Perceptron) and a temporal (LSTM model).…”

Section: Related Workmentioning

confidence: 99%

“…ASD seeks to classify if a given face at a given time in a video is speaking or not [1] [2]. Active speaker determination proves useful in a number of tasks such as human-computer/human-robot interactions [3], where in a eld of view of multiple speakers, a robot needs to know who is talking in order to turn its head or x it's gaze in that direction to visually pay attention and better manage conversations with the different speakers [4], audio-visual diarization (auto annotation of descriptions in video scenes) [5], allowing deaf audience to better appreciate movies [6], video conferencing systems to allow zooming in on the current speaker [7], a necessary step in the auto curation of audio samples from videos where the face image of the subjects are known [8], speaker naming where in addition to detecting the active speaker, the identity is also made known [6], speech enhancement, video re-targeting for meetings [1] and is a basic prerequisite for arti cial cognitive systems in the acquisition of language in social settings [9]. Research in active speaker detection from videos is faced with challenges such as presence of multiple people leading to variability of possible speakers in a video, poor resolution [10], visibility of speaker in video (speakers who are off screen) [11], faces turned at inconvenient in-plane angles to the recording camera, recordings from YouTube are from varying demographics, have different illumination settings and faces are occluded in some cases.…”

Section: Introductionmentioning

confidence: 99%

“…However, there are no hardand-fast rules in this research area as researchers have tried other unique approaches and have used varying metrics for the evaluation of their algorithms, for example the mean Average Precision (mAP) [1,24], Lc performance that provides evaluation from an energy perspective [25], F-Score [26], ratio of correctly predicted samples to total number of test samples [27] and area under Receiver Operating Characteristic (auROC) curve [1,10]. Some researchers have used only facial cues [3,4,9,28,29], others have used just audio cues for example [5] and others have used a combination of both cues [1,2]. Some researchers in addition to facial cues have used head movements, hand gestures and prosody [4,28], yet others such as [3,[30][31][32][33] in order to determine active speakers, rely on the use of an array of multiple microphones and cameras because such setup provides directional and spatial information respectively, the problem with such methods apart from the extra overhead is that in most real-life scenarios such as YouTube videos they are not applicable.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

An Active Speaker Detection Method in Videos using Standard Deviations of Color Histogram

Akinrinmade

Adetiba

Badejo

2022

Preprint

View full text Add to dashboard Cite

Active Speaker Detection (ASD) refers to the process of predicting who amongst a number of speakers whose faces appear on screen is speaking (if any) at any given time within the duration of a video. This paper proposes a novel method for determining active speakers in videos based on the standard deviations of Color Histograms (CHs) of the mouth region from frame-to-frame. The reasoning behind this is that the lips of an active speaker will open and close exposing and concealing the inner contents of the mouth such as the vocal cavity, teeth and tongue at fairly regular intervals in the process which are of different colors. Therefore, if the mouth region can be accurately localized and the changes in the color activities in that region analyzed during speaking such information can be used to detect if a person is actively speaking or not. The lips of a non-speaker are usually closed and at rest, so the CHs for such mouth region are expected to be fairly constant and as such the standard deviations should be low. If an experimentally determined threshold could be set, it can draw the line between active and non-active speakers. In this work, 53 videos available online from Channels TV news, one of Nigeria’s most popular TV stations were used to create 250 video clips totaling 3.6 hours, each ranging from between 15 seconds to 1 minute in such a way that the faces of two speakers were always simultaneously visible in any order in the duration of each video clip. The active speakers in each second of the video clips were manually labeled and used to evaluate the performance of the proposed methodology which achieved a prediction accuracy of up to 99.19%.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Active Speaker Detection Method in Videos using Standard Deviations of Color Histogram

Akinrinmade

Adetiba

Badejo

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Such techniques are successful, for instance, as a means to recognize speech related facial motion while also discerning it from other unrelated movements. In [16,19], Stefanov et al used the outputs of a face detector, along with audio-derived voice activity detection (VAD) labels, to train a CNN task specific feature extractor together with a Perceptron classifier. For transfer learning comparison, known CNN architectures were used as pre-trained feature extractors whose outputs were employed in training temporal (LSTM net) and nontemporal (Perceptron) classifiers.…”

Section: Asd Recent Researchmentioning

confidence: 99%

Bio-Inspired Modality Fusion for Active Speaker Detection

2021

View full text Add to dashboard Cite

Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

show abstract

“…Robot [136,[174][175][176][177][178] Computer vision [135,136,[178][179][180][181] Natural language processing [182,183] Reinforcement…”

Section: Automatic Generation Of Label Datamentioning

confidence: 99%

Object Detection Recognition and Robot Grasping Based on Machine Learning: A Survey

Bai

Yang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

With the rapid development of machine learning, its powerful function in the machine vision field is increasingly reflected. The combination of machine vision and robotics to achieve the same precise and fast grasping as that of humans requires high-precision target detection and recognition, location and reasonable grasp strategy generation, which is the ultimate goal of global researchers and one of the prerequisites for the large-scale application of robots. Traditional machine learning has a long history and good achievements in the field of image processing and robot control. The CNN (convolutional neural network) algorithm realizes training of large-scale image datasets, solves the disadvantages of traditional machine learning in large datasets, and greatly improves accuracy, thereby positioning CNNs as a global research hotspot. However, the increasing difficulty of labeled data acquisition limits their development. Therefore, unsupervised learning, self-supervised learning and reinforcement learning, which are less dependent on labeled data, have also undergone rapid development and achieved good performance in the fields of image processing and robot capture. According to the inherent defects of vision, this paper summarizes the research achievements of tactile feedback in the fields of target recognition and robot grasping and finds that the combination of vision and tactile feedback can improve the success rate and robustness of robot grasping. This paper provides a systematic summary and analysis of the research status of machine vision and tactile feedback in the field of robot grasping and establishes a reasonable reference for future research.

show abstract

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition

Cited by 16 publications

References 43 publications

An Active Speaker Detection Method in Videos using Standard Deviations of Color Histogram

An Active Speaker Detection Method in Videos using Standard Deviations of Color Histogram

Bio-Inspired Modality Fusion for Active Speaker Detection

Object Detection Recognition and Robot Grasping Based on Machine Learning: A Survey

Contact Info

Product

Resources

About