Location-Aware Graph Convolutional Networks for Video Question Answering

Huang, Deng; Chen, Peihao; Zeng, Runhao; Du, Qing; Tan, Mingkui; Gan, Chuang

doi:10.1609/aaai.v34i07.6737

Cited by 156 publications

(112 citation statements)

References 3 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…We compare our MSPAN with the state-of-the-art methods: PSAC (Li et al, 2019), HME (Fan et al, 2019), FAM (Cai et al, 2020), LGCN (Huang et al, 2020), HGA (Jiang and Han, 2020), QueST and HCRN (Le et al, 2020). 1, our method outperforms the state-of-the-art methods by 2.5% and 1.9% of accuracy on Action and Transition tasks.…”

Section: Resultsmentioning

confidence: 99%

Multi-Scale Progressive Attention Network for Video Question Answering

Guo¹,

Zhao²,

Jiao³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Understanding the multi-scale visual information in a video is essential for Video Question Answering (VideoQA). Therefore, we propose a novel Multi-Scale Progressive Attention Network (MSPAN) to achieve relational reasoning between cross-scale video information. We construct clips of different lengths to represent different scales of the video. Then, the cliplevel features are aggregated into node features by using max-pool, and a graph is generated for each scale of clips. For cross-scale feature interaction, we design a message passing strategy between adjacent scale graphs, i.e., topdown scale interaction and bottom-up scale interaction. Under the question's guidance of progressive attention, we realize the fusion of all-scale video features. Experimental evaluations on three benchmarks: TGIF-QA, MSVD-QA and MSRVTT-QA show our method has achieved state-of-the-art performance.

show abstract

Section: Resultsmentioning

confidence: 99%

Multi-Scale Progressive Attention Network for Video Question Answering

Guo¹,

Zhao²,

Jiao³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…There have been some attempts (Xu et al, 2017;Gao et al, 2018; to extract motion and appearance features and integrate them on a spatio-temporal dimension via memory networks. Li et al (2019), Huang et al (2020), proposed better performing models using attention in order to overcome the long-range dependency problem in memory networks. However, they do not represent motion in-formation sufficiently since they only use features pre-trained on image or object classification.…”

Section: Related Workmentioning

confidence: 99%

“…There are three crucial challenges in video QA: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, and causality), and (3) crossmodal grounding between language and vision information. To tackle these challenges, previous studies (Li et al, 2019;Huang et al, 2020) have mainly explored this task by jointly embedding the features from the pre-trained word embedding model (Pennington et al, 2014) and the object detection models . However, as discussed in (Gao et al, 2018), the use of the visual features extracted from the object detection models suffers from motion analysis since the object detection model lacks temporal modeling.…”

Section: Introductionmentioning

confidence: 99%

“…As appearance and motion features share identical operations until the Motion-Appearance Fusion module, we combine superscript a and m for simplicity. Following L-GCN(Huang et al, 2020), we add a location encoding and define local features as:v a/m local = FFN([o a/m ; d s ; d t ])(1)where d s = FFN(b) and d t is obtained by position encoding according to each frame's index.Here o a/m denotes the object features mentioned above while FFN denotes a feed-forward network. Analogous to local features, position encoding information d t is added to global features as well.We then concatenate object features with global features to reflect the frame-level context in objects and obtain the visual representation v a/m ∈ R K×d :…”

mentioning

confidence: 99%

See 1 more Smart Citation

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Seo¹,

Kang²,

Park³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ ahjeongseo/MASN-pytorch.

show abstract

“…Early works [41] rely on hand-craft CNN architecture to further embed video and question. Inspired by [34], [13] utilizes attention-based methods to focus on relevant video clips. Attention mechanism also enhances the interpretability of these models, since it works in a simple but intuitive way.…”

Section: Related Work 21 Video Question Answeringmentioning

confidence: 99%

Relation-aware Hierarchical Attention Framework for Video Question Answering

Bai

Cao

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task. To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects. To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods. CCS CONCEPTS• Information systems → Question answering; • Computing methodologies → Computer vision.

show abstract

Location-Aware Graph Convolutional Networks for Video Question Answering

Cited by 156 publications

References 3 publications

Multi-Scale Progressive Attention Network for Video Question Answering

Multi-Scale Progressive Attention Network for Video Question Answering

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Relation-aware Hierarchical Attention Framework for Video Question Answering

Contact Info

Product

Resources

About