“…There are numerous reasoning VQA methods [1,6,12,14,19,21,37,38,40,52,55,58,59,64] that focus on learning the relations between visual regions and words in questions implicitly, e.g., through message passing [50], pairwise relationship modeling [4], adversarial learning [8,32,51], or graph parsing methods defined by inter/intra-class edges [15]. Other works focus on leveraging external information [18] or explicit scene graph [5] to extract features from input images.…”