“…In these methods, object detectors (Ren et al, 2015) and scene classifiers (He et al, 2016) are used to associate images with external knowledge. Further, external APIs, such as Google (Wu et al, 2022;Luo et al, 2021), Microsoft (Chen et al, 2021a;Yang et al, 2022) and OCR (Luo et al, 2021;Wu et al, 2022) are used to enrich the associated knowledge. Finally, pre-trained transformerbased language models (Chen et al, 2021a;Yang et al, 2022), or multimodal models (Wu et al, 2022;Luo et al, 2021;Wu et al, 2022;Garderes et al, 2020;Marino et al, 2021) are leveraged as implicit knowledge bases for answer predictions.…”