2022
DOI: 10.48550/arxiv.2207.12647
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Abstract: Existing visual question answering methods tend to capture the spurious correlations from visual and linguistic modalities, and fail to discover the true casual mechanism that facilitates reasoning truthfully based on the dominant visual evidence and the correct question intention. Additionally, the existing methods usually ignore the complex event-level understanding in multi-modal settings that requires a strong cognitive capability of causal inference to jointly model cross-modal event temporality, causalit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 86 publications
0
7
0
Order By: Relevance
“…The experimental results on the test set of Traf-ficQA are shown in Table 1, where we also in-3 https://openai.com/blog/clip/ clude the previous baseline models for EVQA. 4 The results show that our proposed approach obtains accuracy of 43.19 under the multiple-choice setting, which surpasses previous state-of-the-art approaches including Eclipse , ERM , TMBC and CMCIR (Liu et al, 2022) by at least 4.5 points. Furthermore, our approach achieves an accuracy of 71.63 under Setting 1/2, outperforming previous strong baselines by at least 6 points.…”
Section: Resultsmentioning
confidence: 76%
See 3 more Smart Citations
“…The experimental results on the test set of Traf-ficQA are shown in Table 1, where we also in-3 https://openai.com/blog/clip/ clude the previous baseline models for EVQA. 4 The results show that our proposed approach obtains accuracy of 43.19 under the multiple-choice setting, which surpasses previous state-of-the-art approaches including Eclipse , ERM , TMBC and CMCIR (Liu et al, 2022) by at least 4.5 points. Furthermore, our approach achieves an accuracy of 71.63 under Setting 1/2, outperforming previous strong baselines by at least 6 points.…”
Section: Resultsmentioning
confidence: 76%
“…) where 𭟋 is the function measuring the similarity between answer candidate and h T −1 , and y i,k represents the answer label for the i−th example -if the correct answer for the i−th example is the k−th answer then y i,k is 1 otherwise it is 0. Models Setting-1/4 Setting-1/2 Q-type (random) 25.00 50.00 QE-LSTM 25.21 50.45 QA-LSTM 26.65 51.02 Avgpooling 30.45 57.50 CNN+LSTM 30.78 57.64 I3D+LSTM 33.21 54.67 VIS+LSTM (Ren et al, 2015) 29.91 54.25 BERT-VQA (Yang et al, 2020) 33.68 63.50 TVQA (Lei et al, 2018) 35.16 63.15 HCRN (Le et al, 2020a) 36.49 63.79 Eclipse 37.05 64.77 ERM 37.11 65.14 TMBC 37.17 65.14 CMCIR (Liu et al, 2022) 38.58 N/A Ours 43.19 71.63…”
Section: Training Objectivementioning
confidence: 99%
See 2 more Smart Citations
“…Traffic flow prediction is a classic spatial-temporal prediction problem [29,24]. Recently, the mainstream approach is combining GCN [10] and RNN [16] to model spatial-temporal correlation [27,30,9,6,31,23,46], such as DCRNN [20], STGCN [48].…”
Section: Introductionmentioning
confidence: 99%