“…Visual-linguistic Pre-training. Following the prominent progress in the transformer-based [63] pretraining in natural language [13,47,32,4,10,48], visual-linguistic pre-training models, either for im-age+text [39,59,8,37,20,73,36,15,35,40] or for video+text [56,35,41,75,33], have achieved great success on a number of downstream V+L tasks. Most existing VL models are designed in a two-step fashion: a pre-trained object detector is used to encode the image as set of regional features (as offline visual tokens) followed by pre-training on a large scale visual-linguistic corpus using tasks like masked language modeling, image-text matching or masked region modeling losses.…”