Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

Xu, Zipeng; Lin, Tianwei; Tang, Hao; Li, Fu; He, Dongliang; Sebe, Nicu; Timofte, Radu; Gool, Luc Van; Ding, Errui

doi:10.48550/arxiv.2111.13333

Cited by 1 publication

(2 citation statements)

References 34 publications

(57 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CLIP-based approaches. Benefiting from the large-scale visual-language training, CLIP has shown impressive capability and generalizability on a wide range of tasks, such as text-driven image manipulation [9,24,38,59], image captioning [14,33], view synthesis [17], object detection [12,49,72], and semantic segmentation [42,73]. These applications mainly focus on building the semantic relationship between texts and visual entities, and hence they suffer less from linguistic ambiguity.…”

Section: Related Workmentioning

confidence: 99%

“…The problem of harnessing CLIP for perception assessment can be more challenging compared to existing works related to objective attributes, such as image manipulation [9,24,38,59], object detection [12,49,72], and semantic segmenta-tion [42,73]. Specifically, CLIP is known to be sensitive to the choices of prompts [41], and perception is an abstract concept with no standardized adjectives, especially for the feel of images.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring CLIP for Assessing the Look and Feel of Images

Wang¹,

Chan²,

Loy³

2022

Preprint

View full text Add to dashboard Cite

Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been developed to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as noise and blurriness levels, such quantification is loosely coupled with human language. When it comes to more abstract perception about the feel of visual content, existing methods can only rely on supervised models that are explicitly trained with labeled data collected via laborious user study. In this paper, we go beyond the conventional paradigms by exploring the rich visual language prior encapsulated in Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look ) and abstract perception (feel ) of images in a zero-shot manner. In particular, we discuss effective prompt designs and show an effective prompt pairing strategy to harness the prior. We also provide extensive experiments on controlled datasets and Image Quality Assessment (IQA) benchmarks. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments. Code will be avaliable at https://github.com/IceClear/CLIP-IQA.

show abstract