Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Roth, Joseph; Chaudhuri, Sourish; Klejch, Ondřej; Marvin, Radhika; Gallagher, Andrew; Kaver, Liat; Ramaswamy, Sharadh; Stopczynski, Arkadiusz; Schmid, Cordelia; Xi, Zhonghua; Pantofaru, Caroline

doi:10.1109/icassp40776.2020.9053900

Cited by 101 publications

(125 citation statements)

References 14 publications

Supporting

Mentioning

125

Contrasting

Order By: Relevance

“…Some studies [4,12,17] simply concatenate the extracted audio and visual features as the input, and apply a multi-layer perceptron (MLP)-based binary classifier to detect the active speaker at each short video segment, without considering the inter-frame temporal dependency. Others further adopt the backend classifier with temporal structure like recurrent neural network (RNN) [44,45], gated recurrent unit (GRU) [35] and long short-term memory (LSTM) [6,40,50], which have achieved preliminary success. Our proposed TalkNet is motivated by this thought.…”

Section: Active Speaker Detectionmentioning

confidence: 99%

Is Someone Speaking?

Tao

Pan

Das

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD. CCS CONCEPTS• Information systems → Speech / audio search.

show abstract

Section: Active Speaker Detectionmentioning

confidence: 99%

Is Someone Speaking?

Tao

Pan

Das

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Moreover, the SCF layer still showed its superiority by majorly improving over concatenation and even surpassing a random choice performance despite having to deal with crippling audio data. As for the AVA video testing, both the concatenation and SCF results beat those presented in the dataset paper [25], because even though testing was carried out with only 20 of the original videos, the performance gain was still exemplary in terms of result excellence. Even more so, the SCF layer successfully improved the already excellent baseline performance.…”

Section: Discussionmentioning

confidence: 72%

“…This proposed method deals with audio and video information jointly and makes no assumptions regarding input data. The audiovisual technique was extensively tested against data from two tried and tested ASD databases, specifically the Columbia [15] and the AVA-ActiveSpeaker [25] datasets. The obtained results were compared to those of audio-only and video-only approaches, as well as those of a naive concatenation multimodal baseline.…”

Section: Discussionmentioning

confidence: 99%

“…The largest and most heterogeneous active speaker detection dataset available at the moment is presumably AVA-ActiveSpeaker [25], developed by Google for the 4th ActivityNet challenge at CVPR 2019. This hand labeled set of segments obtained from 160 YouTube videos amounts to around 38.5 h of audiovisual data, where each of the covered 3.65 million frames is detailed with bounding box locations for all detected speaker faces as well as with three types of labels for each present bounding box-speaking and audible, speaking but not audible, not speaking.…”

Section: Datasetsmentioning

confidence: 99%

“…The evaluation of the proposed ASD system required a considerably large amount of data, preferably with a high degree of heterogeneity and with respect to natural conversations. Yet as previously noted, no single dataset encompasses all these characteristics and so the testing phase examined data from two sources: The Columbia dataset [15]; and the 20 videos considered acceptable from the AVA-ActiveSpeaker dataset [25]. All of the latter's sequences were already under the 10 second mark and so were left untouched.…”

Section: Data Preparationmentioning

confidence: 99%

See 2 more Smart Citations

Bio-Inspired Modality Fusion for Active Speaker Detection

2021

View full text Add to dashboard Cite

Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

show abstract