2021
DOI: 10.1109/tip.2021.3051756
|View full text |Cite
|
Sign up to set email alerts
|

Graph-Based Multi-Interaction Network for Video Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(6 citation statements)
references
References 65 publications
0
6
0
Order By: Relevance
“…For What, Who, and When types, the CMCIR outperforms all the comparison methods significantly. Although GMIN [100] and CASSG [104] perform marginally better than our CMCIR for How and Where types, our CMCIR performs significantly better than GMIN for What (+8.3%), Who (+9.0%), When (+1.6%), and the overall (+8.3%) tasks.…”
Section: Results On Other Benchmark Datasetsmentioning
confidence: 68%
See 1 more Smart Citation
“…For What, Who, and When types, the CMCIR outperforms all the comparison methods significantly. Although GMIN [100] and CASSG [104] perform marginally better than our CMCIR for How and Where types, our CMCIR performs significantly better than GMIN for What (+8.3%), Who (+9.0%), When (+1.6%), and the overall (+8.3%) tasks.…”
Section: Results On Other Benchmark Datasetsmentioning
confidence: 68%
“…• GMIN [100]: A graph-based relation-aware neural network that explores the relationships and dependencies between objects spatially and temporally.…”
Section: Results On Other Benchmark Datasetsmentioning
confidence: 99%
“…However, the monolithic graph is cumbersome to extend to long videos with multiple objects. More recently, GMIN (Gu et al 2021) builds a spatio-temporal graph over object trajectories and shows improvements over its attention version (Jin et al 2019). While L-GCN and GMIN construct query-blind graphs, HGA (Jiang and Han 2020), DualVGR (Wang, Bao, and Xu 2021) and B2A (Park, Lee, and Sohn 2021) design query-specific graphs for better performance.…”
Section: Related Workmentioning
confidence: 99%
“…In Table 1 and Table 2, we compare our model with some established VideoQA techniques covering 4 major categories: 1) cross-attention (e.g., ST-VQA (Jang et al 2017), PSAC (Li et al 2019b), STA (Gao et al 2019), MIN (Jin et al 2019) and QueST )), 2) motion-appearance memory (e.g., AMU (Xu et al 2017), Co-Mem (Gao et al 2018) and HME (Fan et al 2019)), 3) graph-structured models (e.g., L-GCN 2 (Huang et al 2020), HGA (Jiang and Han 2020), DualVGR (Wang, Bao, and Xu 2021), GMIN (Gu et al 2021) and B2A (Park, Lee, and Sohn 2021)) and 4) hierarchical models (e.g., HCRN (Le et al 2020) and HOSTR (Dang et al 2021)). The results show that our Hierarchical QGA (HQGA) model performs consistently better than the others on all the experimented datasets.…”
Section: The State Of the Art Comparisonmentioning
confidence: 99%
“…New generalizations and definitions are developed to handle the complexity of structured data, so a graph convolution is generalized from a 2D convolution and GNNs are evolved to overcome the CNNs complexity [12]. GCN performs generalized convolutions from image to graph [13], and has been adopted in many applications efficiently [14], [15], [16]. Natural structure of the human skeleton is the geometric space graph, where the joints as nodes and their natural bones relations in the human body as edges.…”
Section: Introductionmentioning
confidence: 99%