2021
DOI: 10.48550/arxiv.2104.08910
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Open-World Text-Guided Face Image Generation and Manipulation

Weihao Xia,
Yujiu Yang,
Jing-Hao Xue
et al.

Abstract: The existing text-guided image synthesis methods can only produce limited quality results with at most 256 2 resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 2 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 54 publications
(125 reference statements)
0
7
0
Order By: Relevance
“…It is said that "a picture is worth a thousand words", but recent research indicates that only a few words are often sufficient to describe one. Recent works that leverage the tremendous progress in vision-language models and datadriven image generation have demonstrated that text-based interfaces for image creation and manipulation are now finally within reach [12,24,29,30,39,40,42,48,53,59].…”
Section: Introductionmentioning
confidence: 99%
“…It is said that "a picture is worth a thousand words", but recent research indicates that only a few words are often sufficient to describe one. Recent works that leverage the tremendous progress in vision-language models and datadriven image generation have demonstrated that text-based interfaces for image creation and manipulation are now finally within reach [12,24,29,30,39,40,42,48,53,59].…”
Section: Introductionmentioning
confidence: 99%
“…But because of the domain limitation of the existing generation models, the images they generate are limited in certain domains. Xia et al [34] map input text to StyleGAN latent space, while Xia et al [35] use cosine similarity of text and image embeddings encoded by CLIP as a loss function to optimize an embedding in StyleGAN latent space. Due to the usage of CLIP, Xia et al [35] can process texts with more complex semantics.…”
Section: Text-to-image Translationmentioning
confidence: 99%
“…Xia et al [34] map input text to StyleGAN latent space, while Xia et al [35] use cosine similarity of text and image embeddings encoded by CLIP as a loss function to optimize an embedding in StyleGAN latent space. Due to the usage of CLIP, Xia et al [35] can process texts with more complex semantics. But its performance is random and visually unpleasant.…”
Section: Text-to-image Translationmentioning
confidence: 99%
See 1 more Smart Citation
“…They tokenize the images and take the image tokens and word tokens to make auto-regressive training by a unidirectional transformer [1,18]. [17,27,28]. They employ a pretrained StyleGAN [8] and manipulate its latent according to face descriptions.…”
Section: Related Workmentioning
confidence: 99%