2021
DOI: 10.48550/arxiv.2106.10446
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Abstract: Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 41 publications
0
5
0
Order By: Relevance
“…We verify the proposed model on three wellknown datasets, MSRVTT-QA (Xu et al, 2017a), MSRVTT multi-choice (Yu et al, 2018a), and TGIF-QA (Jang et al, 2017), widely used in recent video QA works (Jang et al, 2017;Gao et al, 2018;Li et al, 2019;Fan et al, 2019;Le et al, 2020;Zhu and Yang, 2020;Lei et al, 2021;Seo et al, 2021). Experiments show that our model achieves dramatic improvement over the powerful state-of-the-art model ClipBERT (Lei et al, 2021), with an average accuracy increment of more than 3 percentage points.…”
Section: Introductionmentioning
confidence: 89%
See 1 more Smart Citation
“…We verify the proposed model on three wellknown datasets, MSRVTT-QA (Xu et al, 2017a), MSRVTT multi-choice (Yu et al, 2018a), and TGIF-QA (Jang et al, 2017), widely used in recent video QA works (Jang et al, 2017;Gao et al, 2018;Li et al, 2019;Fan et al, 2019;Le et al, 2020;Zhu and Yang, 2020;Lei et al, 2021;Seo et al, 2021). Experiments show that our model achieves dramatic improvement over the powerful state-of-the-art model ClipBERT (Lei et al, 2021), with an average accuracy increment of more than 3 percentage points.…”
Section: Introductionmentioning
confidence: 89%
“…Existing methods for video QA conduct direct answering selection based on the multimodal encoding of questions and videos (Jang et al, 2017;Lei et al, 2018Lei et al, , 2020. In recent years, researchers have proposed many optimization strategies for better performance in video question answering, e.g., designing delicate encoding mechanisms (Kim et al, 2020a;Nuamah, 2021;Gao et al, 2018;Li et al, 2019;Fan et al, 2019;Le et al, 2020;Jiang et al, 2020;Kim et al, 2020b;Seo et al, 2021) graphs , adopting video pre-trained language models (Li et al, 2020;Zellers et al, 2021;Li and Wang, 2020;Lei et al, 2021;Sun et al, 2019), and leveraging external knowledge or resources (Chadha et al, 2020;Liu et al, 2020b;Song et al, 2021;. Compared with conventional monomodal question answering tasks such as text QA (Oguz et al, 2021;Zhou et al, 2018; and table QA .…”
Section: Introductionmentioning
confidence: 99%
“…Since we are dealing with spatial and temporal dependencies graphs can help establish these dependencies very well and the work by (Seo et al 2021) presents the same. Object graphs are constructed via graph convolutional networks (GCN) to compute the relationships among objects in each visual feature.…”
Section: Graph Based Techniquesmentioning
confidence: 94%
“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”
Section: Faster Rcnnmentioning
confidence: 99%