“…CLIP-based approaches. Benefiting from the large-scale visual-language training, CLIP has shown impressive capability and generalizability on a wide range of tasks, such as text-driven image manipulation [9,24,38,59], image captioning [14,33], view synthesis [17], object detection [12,49,72], and semantic segmentation [42,73]. These applications mainly focus on building the semantic relationship between texts and visual entities, and hence they suffer less from linguistic ambiguity.…”