“…Popular image-language models such as CLIP [83] and ALIGN [48] are trained on massive datasets by using web images and alt-text. Similarly, videolanguage models are catching up and can be categorised into two broad directions: (i) adapting image-language models for videos [8,22,49,50,62,65,71,108,110,119], and (ii) pure video-based models that are learned using large video-text datasets [3,7,[26][27][28]30,57,61,64,67,68,95,117]. Recently, a new paradigm of post-pretraining has emerged where an existing image-or video-language model goes through another stage of self-supervised pretraining on a small amount of video data before it is evaluated on downstream tasks [65,119].…”