Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3548061
|View full text |Cite
|
Sign up to set email alerts
|

Dynamic Spatio-Temporal Modular Network for Video Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 32 publications
0
5
0
Order By: Relevance
“…We compare with 3 representative video QA models: HME (Fan et al 2019) is a memory-network-based model to encode video and text features; HCRN (Le et al 2020) uses conditional relational networks to build a hierarchical structure that learns video representation on both clip level and video level; PSAC (Li et al 2019) uses both video and question positional self-attention instead of RNNs to model dependencies of questions and temporal relationships of videos. To compare with models that explicitly model the multi-step reasoning process, we also compare with DSTN (Qian et al 2022), a neural module network concurrent to our work, and MAC (Hudson and Manning 2018) which performs iterative attention-based reasoning with a recurrent "Memory, Attention and Composition" cell. We make minor modifications on the attention of MAC to attend to 2-D (T × dim V ) temporal features instead of 3-D (H × W × dim V ) spatial features.…”
Section: Model Implementationsmentioning
confidence: 99%
See 4 more Smart Citations
“…We compare with 3 representative video QA models: HME (Fan et al 2019) is a memory-network-based model to encode video and text features; HCRN (Le et al 2020) uses conditional relational networks to build a hierarchical structure that learns video representation on both clip level and video level; PSAC (Li et al 2019) uses both video and question positional self-attention instead of RNNs to model dependencies of questions and temporal relationships of videos. To compare with models that explicitly model the multi-step reasoning process, we also compare with DSTN (Qian et al 2022), a neural module network concurrent to our work, and MAC (Hudson and Manning 2018) which performs iterative attention-based reasoning with a recurrent "Memory, Attention and Composition" cell. We make minor modifications on the attention of MAC to attend to 2-D (T × dim V ) temporal features instead of 3-D (H × W × dim V ) spatial features.…”
Section: Model Implementationsmentioning
confidence: 99%
“…Temporal module is designed to transform the attention scores according to the switch keyword s. We use the same metric IoU att to evaluate the attention scores output att out . Inspired by (Qian et al 2022), we randomly sample two frames as the start and end frames as baseline results. Specially, the start frame is always the first frame when s = 'before', and the end frame is always the last frame when s = 'after'.…”
Section: Evaluation and Visualization Of Modules' Intermediate Outputmentioning
confidence: 99%
See 3 more Smart Citations