2022
DOI: 10.1007/978-3-031-20059-5_3
|View full text |Cite
|
Sign up to set email alerts
|

Video Graph Transformer for Video Question Answering

Abstract: We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
23
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 45 publications
(23 citation statements)
references
References 71 publications
0
23
0
Order By: Relevance
“…Those methods that are only capable of descriptive content recognition cannot perform well, because they hardly capture the subtle transitions in the same scene in long-horizon. To this end, recent work [53,54] proposes to encode video as a local-to-global dynamic graph of spatiotemporal objects, so that the interaction relations can be encoded. However, the VideoQA models [17,53,54] built upon the dynamic graph of patches may easily be distracted by the object's appearance and capture limited motion information.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Those methods that are only capable of descriptive content recognition cannot perform well, because they hardly capture the subtle transitions in the same scene in long-horizon. To this end, recent work [53,54] proposes to encode video as a local-to-global dynamic graph of spatiotemporal objects, so that the interaction relations can be encoded. However, the VideoQA models [17,53,54] built upon the dynamic graph of patches may easily be distracted by the object's appearance and capture limited motion information.…”
Section: Related Workmentioning
confidence: 99%
“…To this end, recent work [53,54] proposes to encode video as a local-to-global dynamic graph of spatiotemporal objects, so that the interaction relations can be encoded. However, the VideoQA models [17,53,54] built upon the dynamic graph of patches may easily be distracted by the object's appearance and capture limited motion information. We alleviate the distraction by a novel two-stage training to ensure a faithful representation of motions that are critical for temporality reasoning.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations