“…Since the advent of CLIP [39], training large visionlanguage models (VLMs) has become a prominent paradigm for representation learning in computer vision. By observing huge corpora of paired images and captions crawled from the Web, these models learn a powerful and rich joint image-text embedding space, which have been employed in numerous visual tasks, including classification [60,61], segmentation [28,57], motion generation [49], image captioning [32,50], text-to-image generation [10,30,34,42,46] and image or video editing [3,5,7,17,24,37,54]. Recently, VLMs have also been a key component in text-toimage generative models [4,40,42,45], which rely on their textual representations to encapsulate the rich and semantic meaning of the input text prompt.…”