“…The current dominant framework for VQA systems consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In lieu of directly using visual features from CNN-based feature extractors, [56,11,41,33,49,38,63,36] explored various image attention mechanisms to locate regions that are relevant to the question. To learn a better representation of the question, [33,38,11] proposed to perform question-guided image attention and image-guided question attention collaboratively, to merge knowledge from both visual and textual modalities in the encoding stage.…”