Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.317
|View full text |Cite
|
Sign up to set email alerts
|

Check It Again:Progressive Visual Question Answering via Visual Entailment

Abstract: While sophisticated Visual Question Answering models have achieved remarkable success, they tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(7 citation statements)
references
References 24 publications
0
7
0
Order By: Relevance
“…However, some studies [23] have found that this approach removes the good language prior required by the model. Alternatively, SAR [24] uses an additional model for answer filtering by adding semantic information about the answers to the model. However, we found that this method makes the model computationally intensive and uses a pre-trained model LXMERT [11] that uses a partial sample of the VQA-CP v2 test set.…”
Section: Methods Modifying Model Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…However, some studies [23] have found that this approach removes the good language prior required by the model. Alternatively, SAR [24] uses an additional model for answer filtering by adding semantic information about the answers to the model. However, we found that this method makes the model computationally intensive and uses a pre-trained model LXMERT [11] that uses a partial sample of the VQA-CP v2 test set.…”
Section: Methods Modifying Model Architecturementioning
confidence: 99%
“…We implemented our approach on the baseline models of UpDn [10], CSS [27], LXMERT [34], and VLMO [12], respectively, and compared it with other mainstream approaches, namely SAR [24], Rescaling [32], CL [38], CSS þ IntorD [22], DM [28], LMH-AttReg [26], and VPCL [25]. The VLMO model employs a shared embedding space to process images and text.…”
Section: Baseline Modelsmentioning
confidence: 99%
“…Third, adversarial methods [ 19 ], such as using adversarial losses, are used to reduce known sources of bias by inducing errors in the model when it is presented with only the question. Fourth, contrast learning-based methods [ 4 , 20 , 21 , 22 ] are used to enhance the utilization of information between the visual context and the question by constructing negative image-question pairs. Fifth, an additional annotation-based approach, refs.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, we do not directly compare experimentally with this line of works. Most recently, another line of work has been released where an additional objective of Visual Entailment is added to further boost performance (Si et al, 2021), which we do not compare for fairness.…”
Section: Related Workmentioning
confidence: 99%