“…Gan et al (2017) and Zhao et al (2020) have suggested style-guided captioning, but also employ training over paired data. CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022).…”