Proceedings of the 27th ACM International Conference on Multimedia 2019
DOI: 10.1145/3343031.3351065
|View full text |Cite
|
Sign up to set email alerts
|

Multi-interaction Network with Object Relation for Video Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 59 publications
(19 citation statements)
references
References 29 publications
0
19
0
Order By: Relevance
“…Video-Text Matching Video-text matching has been widely studied in various tasks, such as video retrieval (Liu et al, 2019;, moment localization with natural language (Zhang et al, 2019(Zhang et al, , 2020 and video question and answering (Xu et al, 2017;Jin et al, 2019b). It aims to learn video-semantic representation in a joint embedding space.…”
Section: Related Workmentioning
confidence: 99%
“…Video-Text Matching Video-text matching has been widely studied in various tasks, such as video retrieval (Liu et al, 2019;, moment localization with natural language (Zhang et al, 2019(Zhang et al, , 2020 and video question and answering (Xu et al, 2017;Jin et al, 2019b). It aims to learn video-semantic representation in a joint embedding space.…”
Section: Related Workmentioning
confidence: 99%
“…To answer such semantic-complicated questions that are based on fine-grained comprehension of video content, relational reasoning-based methods [4]- [6], [14]- [16], [20] are proposed. More specifically, Jin et al [4] propose a multimodal and multi-level interaction network to capture relations between objects. Jiang et al [14] develop a heterogenous graph alignment network to integrate the relations of both inter-and intra-modality for cross-modal reasoning.…”
Section: A Video Question Answeringmentioning
confidence: 99%
“…1) Comparisons on MSRVTT-QA: For MSRVTT-QA, we compare the proposed LiVLR with recent methods, including Park et al [21], DualVGR [15], HGA [14], MASN [6], HCR [16], MiNOR [4], HME [10], GRA [1], Co-mem [3], ST-VQA [2], VQA-T [13], CoMVT [68], ClipBERT [12], and SSML [67]. It is worth noting that VQA-T, CoMVT, ClipBERT, and SSML (x) adopt large-scale video-language pretraining to enhance the downstream VideoQA task.…”
Section: B Comparisons With State-of-the-artsmentioning
confidence: 99%
See 1 more Smart Citation
“…Some researchers have presented to capture fine-grained appearance-question interactions. Jin et al [13] introduced object-aware temporal attention that learns object-question interactions. Huang etal [10] also utilize frame and object features to enhance coattention between the appearance and question.…”
Section: Related Workmentioning
confidence: 99%