“…Recent years have seen a rapid progress made in vision-language pretraining (Uppal et al, 2020;Han et al, 2021;Khan et al, 2021). While a variety of approaches have been proposed, a large portion of them require object detection for image region feature regression or tagging as part of the pre-training objectives, for example LXMERT (Tan & Bansal, 2019), VLBERT (Su et al, 2020), VisualBERT (Li et al, 2019), UNITER (Chen et al, 2020b), Villa (Gan et al, 2020), Oscar , ERNIE-ViL (Yu et al, 2021), UNIMO , VinVL , VIVO VL-T5 (Cho et al, 2021) etc. These methods rely on a strong object detection model like Fast(er) R-CNN (Ren et al, 2015), which is often trained on human annotated data sets like Visual Genome (Krishna et al, 2016).…”