2021
DOI: 10.48550/arxiv.2111.13792
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Abstract: One of the major challenges in training text-to-image generation models is the need of a large number of highquality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time-and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multimodal semantic space of the powerful pre-trained CLIP model… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
43
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(43 citation statements)
references
References 45 publications
0
43
0
Order By: Relevance
“…The human evaluator may also indicate that neither image is significantly better than the other, in which case half of a win is assigned to both models. (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN (Zhang et al, 2021) 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E (Ramesh et al, 2021) ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE 12.24 GLIDE (Validation filtered)…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The human evaluator may also indicate that neither image is significantly better than the other, in which case half of a win is assigned to both models. (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN (Zhang et al, 2021) 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E (Ramesh et al, 2021) ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE 12.24 GLIDE (Validation filtered)…”
Section: Quantitative Resultsmentioning
confidence: 99%
“… edits images using text prompts by fine-tuning a diffusion model to target a CLIP loss while reconstructing the original image's DDIM latent Zhou et al (2021). trains GAN models conditioned on perturbed CLIP image embeddings, resulting in a model which can condition images on CLIP text embeddings.…”
mentioning
confidence: 99%
“…We see that CM3 is capable of generating non-trivial semantically coherent captions. That being said, most failure cases of our proposed zero-shot captioning are due Model FID Zero-shot FID AttnGAN (Xu et al, 2017) 35.49 DM-GAN (Zhu et al, 2019) 32.64 DF-GAN (Tao et al, 2020) 21.42 DM-GAN + CL (Ye et al, 2021) 20.79 XMC-GAN 9.33 LAFITE (Zhou et al, 2021) 8.12 DALL-E ∼ 28 LAFITE (Zhou et al, 2021) 26.94 GLIDE (Nichol et al, 2021) 12. 2021) we sample roughly 30k conditioned samples for our models, and compare against the entire validation set.…”
Section: Source Imagementioning
confidence: 97%
“…Text-to-image generation [64,72,54,65,67,45,12,41,70] focuses on generating images from standalone text descriptions. Preliminary text-to-image methods conditioned RNN-based DRAW [18] on text [40].…”
Section: Text-to-image Generationmentioning
confidence: 99%
“…Inspired by the high-quality unconditional images generation model, GLIDE employed guided inference with and without a classifier network to generate high-fidelity images. LAFITE [70] employed a pre-trained CLIP [44] model to project text and images to the same latent space, training text-to-image models without text data. Similarly to DALL-E and CogView, we train an autoregressive transformer model on text and image tokens.…”
Section: Text-to-image Generationmentioning
confidence: 99%