2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01769
|View full text |Cite
|
Sign up to set email alerts
|

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

Abstract: StyleCLIP with bangs double chin black hair with wrinkles pale* The work was done during Zipeng Xu's internship at VIS, Baidu.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 29 publications
(24 citation statements)
references
References 30 publications
0
24
0
Order By: Relevance
“…Zhang et al [115] propose a disentangled sentiment representation adversarial network (DiSRAN) to reduce the domain shift of expressive styles for cross-domain sentiment analysis. Recent works [116], [117], [118], [119], [120] tend to focus on disentangling the rich information among multi modalities and leveraging that to perform various downstream tasks. Alaniz et al [116] propose to use the semantic structure of the text to disentangle the visual data, in order to learn an unified representation between the text and image.…”
Section: Multimodal Applicationmentioning
confidence: 99%
See 2 more Smart Citations
“…Zhang et al [115] propose a disentangled sentiment representation adversarial network (DiSRAN) to reduce the domain shift of expressive styles for cross-domain sentiment analysis. Recent works [116], [117], [118], [119], [120] tend to focus on disentangling the rich information among multi modalities and leveraging that to perform various downstream tasks. Alaniz et al [116] propose to use the semantic structure of the text to disentangle the visual data, in order to learn an unified representation between the text and image.…”
Section: Multimodal Applicationmentioning
confidence: 99%
“…Alaniz et al [116] propose to use the semantic structure of the text to disentangle the visual data, in order to learn an unified representation between the text and image. The PPE framework [117] realizes disentangled text-driven image manipulation through exploiting the power of the pretrained vision-language model CLIP [121]. Similarly, Yu et al [118] achieve counterfactual image manipulation via disentangling and leveraging the semantic in text embedding of CLIP.…”
Section: Multimodal Applicationmentioning
confidence: 99%
See 1 more Smart Citation
“…To generate some specific kind of images given the input for users' goal, Conditional GAN (CGAN) [23] is proposed. A CGAN always combines a basic GAN and an external information, such as labels [24], [25], [26], text descriptions [14], [27], [28], segmentation maps [29], [30], [31], [32], [33], [34], [35], [36], [37], [38] and images [8], [39]. For example, GANmut [24] introduces a novel GAN-based framework that learns an expressive and interpretable conditional space to generate a gamut of emotions, using only the categorical emotion labels.…”
Section: Related Workmentioning
confidence: 99%
“…Further, they can be used to generate novel examples not found in the original dataset [ 21 ]. Such feature learning also supports a variety of other applications, such as super-resolution [ 22 ], multimodal application [ 23 , 24 , 25 , 26 , 27 ], medical imaging [ 28 , 29 ], video prediction [ 30 , 31 , 32 , 33 , 34 ], natural language processing [ 35 , 36 , 37 ], transfer learning and zero-shot learning [ 38 ].…”
Section: Introductionmentioning
confidence: 99%