“…However, today's VQA models still suffer from severe language biases (Agrawal et al, 2018), over-relying on linguistic correlations rather than multi-modal reasoning. To realize robust VQA, recent works (Chen et al, 2020(Chen et al, , 2023Kolling et al, 2022;Agarwal et al, 2020;Gokhale et al, 2020a,b;Boukhers et al, 2022;Tang et al, 2020;Kant et al, 2021;Bitton et al, 2021;Askarian et al, 2022;Wang et al, 2021b) employ various data augmentation (DA) techniques by generating extra training samples, to enhance VQA models' performance on both in-domain (ID) (Goyal et al, 2017) and out-ofdistribution (OOD) datasets (Agrawal et al, 2018).…”