(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Cherian, Anoop; Hori, Chiori; Marks, Tim K.; Roux, Jonathan Le

doi:10.1609/aaai.v36i1.19922

Cited by 23 publications

(5 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…social connections), their combination in the video domain is still sparse. Two recent works [65], [66] have explored graph transformers for video-language tasks. [66] focuses on video dialogues and simply applying a global transformer over pooled graph representations built from static frames to represent a video.…”

Section: Transformer Over Visual Graphmentioning

confidence: 99%

See 1 more Smart Citation

Video Graph Transformer for Video Question Answering

Xiao

Zhou²,

Chua

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully-and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.

show abstract

Section: Transformer Over Visual Graphmentioning

confidence: 99%

“…[66] focuses on video dialogues and simply applying a global transformer over pooled graph representations built from static frames to represent a video. [65] proposes a tailored-made similarity-kernel in the self-attention blocks to capture the proximity of nodes in a pseudo 3D space.…”

Section: Transformer Over Visual Graphmentioning

confidence: 99%

Video Graph Transformer for Video Question Answering

Xiao

Zhou²,

Chua

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The module is more parameter efficient but has little impact on the performances (CMTrans→CM in Table 2). Compared with other graph based methods [9,23,60], VGT enjoys several advantages: 1) It explicitly model the temporal dynamics of both objects and their interactions. 2) It solves VideoQA by explicit similarity comparison between the video and text instead of classification.…”

Section: Sate-of-the-art Comparisonmentioning

confidence: 99%

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled crossmodal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from selfsupervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.

show abstract

“…However, these scene graphs are usually 2D, while we attempt to explicitly incorporate the 3D scene geometry into the scene graphs. While, Cherian et al [9] proposes (2.5+1)D scene graphs for video question answering, our task of audio source separation and motion prediction brings in several novel components beyond their setup.…”

Section: Related Workmentioning

confidence: 99%

“…Inspired by Cherian et al [9], our ASMP framework begins by computing a dense 2.5D representation of the frames of a video where 2.5D refers to the 2D visual context of the frames enriched with the pseudo-depth for that frame produced using a 2D-to-3D monocular depth prediction method [41]. Next, we succinctly capture the semantic context of this 2.5D visual scene by means of a novel 2.5D scene graph representation.…”

Section: Introductionmentioning

confidence: 99%

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

Chatterjee¹,

Ahuja²,

Cherian³

2022

Preprint

View full text Add to dashboard Cite

There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (ii) predicting the 3D visual motion of a sounding source using its separated audio. Towards this end, we present Audio Separator and Motion Predictor (ASMP) -a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. At the heart of ASMP is a 2.5D scene graph capturing various objects in the video and their pseudo-3D spatial proximities. This graph is constructed by registering together 2.5D monocular depth predictions from the 2D video frames and associating the 2.5D scene regions with the outputs of an object detector applied on those frames. The ASMP task is then mathematically modeled as the joint problem of: (i) recursively segmenting the 2.5D scene graph into several sub-graphs, each associated with a constituent sound in the input audio mixture (which is then separated) and (ii) predicting the 3D motions of the corresponding sound sources from the separated audio. To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods.

show abstract

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Cited by 23 publications

References 42 publications

Video Graph Transformer for Video Question Answering

Video Graph Transformer for Video Question Answering

Video Graph Transformer for Video Question Answering

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

Contact Info

Product

Resources

About