“…Trained on a vast dataset of 400 million (image, text) pairs, it strives to learn a joint semantical space for both images and texts using contrastive loss. Benefiting from its excellent image/text representation ability, CLIP has found extensive applications in different areas, such as domain adaptation [32], image segmentation [33], [34], image generation and editing [12], [13], [35], [36], [37], [38], [39], [40]. In particular, LAFITE [41] and KNN-Diffusion [42] utilize the CLIP image-text feature space and exploit languagefree text-to-image generation.…”