“…Visual Question Answering (VQA) The conventional visual question answering (VQA) task aims to answer questions pertaining to a given image. Multiple VQA datasets have been proposed, such as Visual Genome QA [25] VQA [2], GQA [16], CLEVR [22], MovieQA [53] and so on. Many works have shown state-of-the-art performance on VQA tasks, including task-specific VQA models with various cross-modality fusion mechanisms [13,20,24,49,62,66,67] and joint vision-language models that are pretrained on large-scale vision-language corpus and finetuned on VQA tasks [6,11,29,30,33,52,68].…”