“…To obtain better generalization and data-efficiency of the model, we perform data augmentation on both images and texts during the pre-training phase to construct more image-text pairs. We apply AutoAugment (Krizhevsky et al, 2012;Sato et al, 2015;Cubuk et al, 2019;Hoffer et al, 2020) for image augmentation, following the SOTA vision recognition methods (Touvron et al, 2021;Xie et al, 2020b). To ensure the augmented texts are semantically similar as the original one, for text augmentation, we rewrite the original text using back-translation (Xie et al, 2020a;Sennrich et al, 2016a).…”