2022
DOI: 10.1609/aaai.v36i1.19922
|View full text |Cite
|
Sign up to set email alerts
|

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Abstract: Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D ``views'' of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 23 publications
(5 citation statements)
references
References 42 publications
0
5
0
Order By: Relevance
“…social connections), their combination in the video domain is still sparse. Two recent works [65], [66] have explored graph transformers for video-language tasks. [66] focuses on video dialogues and simply applying a global transformer over pooled graph representations built from static frames to represent a video.…”
Section: Transformer Over Visual Graphmentioning
confidence: 99%
See 1 more Smart Citation
“…social connections), their combination in the video domain is still sparse. Two recent works [65], [66] have explored graph transformers for video-language tasks. [66] focuses on video dialogues and simply applying a global transformer over pooled graph representations built from static frames to represent a video.…”
Section: Transformer Over Visual Graphmentioning
confidence: 99%
“…[66] focuses on video dialogues and simply applying a global transformer over pooled graph representations built from static frames to represent a video. [65] proposes a tailored-made similarity-kernel in the self-attention blocks to capture the proximity of nodes in a pseudo 3D space.…”
Section: Transformer Over Visual Graphmentioning
confidence: 99%
“…The module is more parameter efficient but has little impact on the performances (CMTrans→CM in Table 2). Compared with other graph based methods [9,23,60], VGT enjoys several advantages: 1) It explicitly model the temporal dynamics of both objects and their interactions. 2) It solves VideoQA by explicit similarity comparison between the video and text instead of classification.…”
Section: Sate-of-the-art Comparisonmentioning
confidence: 99%
“…However, these scene graphs are usually 2D, while we attempt to explicitly incorporate the 3D scene geometry into the scene graphs. While, Cherian et al [9] proposes (2.5+1)D scene graphs for video question answering, our task of audio source separation and motion prediction brings in several novel components beyond their setup.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by Cherian et al [9], our ASMP framework begins by computing a dense 2.5D representation of the frames of a video where 2.5D refers to the 2D visual context of the frames enriched with the pseudo-depth for that frame produced using a 2D-to-3D monocular depth prediction method [41]. Next, we succinctly capture the semantic context of this 2.5D visual scene by means of a novel 2.5D scene graph representation.…”
Section: Introductionmentioning
confidence: 99%