2020
DOI: 10.1609/aaai.v34i07.6737
|View full text |Cite
|
Sign up to set email alerts
|

Location-Aware Graph Convolutional Networks for Video Question Answering

Abstract: We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
112
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 156 publications
(112 citation statements)
references
References 3 publications
0
112
0
Order By: Relevance
“…We compare our MSPAN with the state-of-the-art methods: PSAC (Li et al, 2019), HME (Fan et al, 2019), FAM (Cai et al, 2020), LGCN (Huang et al, 2020), HGA (Jiang and Han, 2020), QueST and HCRN (Le et al, 2020). 1, our method outperforms the state-of-the-art methods by 2.5% and 1.9% of accuracy on Action and Transition tasks.…”
Section: Resultsmentioning
confidence: 99%
“…We compare our MSPAN with the state-of-the-art methods: PSAC (Li et al, 2019), HME (Fan et al, 2019), FAM (Cai et al, 2020), LGCN (Huang et al, 2020), HGA (Jiang and Han, 2020), QueST and HCRN (Le et al, 2020). 1, our method outperforms the state-of-the-art methods by 2.5% and 1.9% of accuracy on Action and Transition tasks.…”
Section: Resultsmentioning
confidence: 99%
“…There have been some attempts (Xu et al, 2017;Gao et al, 2018; to extract motion and appearance features and integrate them on a spatio-temporal dimension via memory networks. Li et al (2019), Huang et al (2020), proposed better performing models using attention in order to overcome the long-range dependency problem in memory networks. However, they do not represent motion in-formation sufficiently since they only use features pre-trained on image or object classification.…”
Section: Related Workmentioning
confidence: 99%
“…There are three crucial challenges in video QA: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, and causality), and (3) crossmodal grounding between language and vision information. To tackle these challenges, previous studies (Li et al, 2019;Huang et al, 2020) have mainly explored this task by jointly embedding the features from the pre-trained word embedding model (Pennington et al, 2014) and the object detection models . However, as discussed in (Gao et al, 2018), the use of the visual features extracted from the object detection models suffers from motion analysis since the object detection model lacks temporal modeling.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Early works [41] rely on hand-craft CNN architecture to further embed video and question. Inspired by [34], [13] utilizes attention-based methods to focus on relevant video clips. Attention mechanism also enhances the interpretability of these models, since it works in a simple but intuitive way.…”
Section: Related Work 21 Video Question Answeringmentioning
confidence: 99%