2022
DOI: 10.48550/arxiv.2204.02849
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

KNN-Diffusion: Image Generation via Large-Scale Retrieval

Abstract: While the availability of massive Text-Image datasets is shown to be extremely useful in training large-scale generative models (e.g. DDPMs, Transformers), their output typically depends on the quality of both the input text, as well as the training dataset. In this work, we show how largescale retrieval methods, in particular efficient K-Nearest-Neighbors (KNN) search, can be used in order to train a model to adapt to new samples. Learning to adapt enables several new capabilities. Sifting through billions of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 31 publications
0
10
0
Order By: Relevance
“…Other works like Cogview [8], and M6 [25] have also achieved very promising results based on autoregressive models. [17,27,18,7,1,37], video generation [20], and audio generation [23,31]. It was first proposed in [17], and the subsequent work [27] proposes a reparameterization to stabilize the training.…”
Section: Related Workmentioning
confidence: 99%
“…Other works like Cogview [8], and M6 [25] have also achieved very promising results based on autoregressive models. [17,27,18,7,1,37], video generation [20], and audio generation [23,31]. It was first proposed in [17], and the subsequent work [27] proposes a reparameterization to stabilize the training.…”
Section: Related Workmentioning
confidence: 99%
“…Concurrent Work Very recently, two concurrent approaches related to our work, unCLIP [42] and kNN-Diffusion [1], have been proposed. unCLIP produces very high quality text-image results by conditioning a diffusion model on the image representation of CLIP [40] and employing large-scale computation.…”
Section: Related Workmentioning
confidence: 99%
“…When building the set M (k) D (c text ) by directly using the CLIP encodings φ CLIP (c text ) of the actual textual description itself (top row), we interestingly see that our model generalizes to generating fictional descriptions and transferring attributions across object classes. However, when using φ CLIP (c text ) together with its k − 1 nearest neighbors from the database D as done in [1], the model does not generalize to these difficult conditional inputs (mid row). When omitting the text representation and only using the k CLIP image representations of the nearest neighbors, the results get even worse (bottom row).…”
Section: Conditional Image Generation Without Conditional Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…However, these models are very compute intensive and so far cannot be reused for tasks other than those for which they were trained. For this reason, in the present work we build on the recently introduced retrieval-augmented diffusion models (RDMs) [3,2], which also have the potential to significantly reduce the computational complexity required in training by providing a comparatively small generative model with a large image database: While the retrieval approach provides the (local) content, the model can now focus on learning the composition of scenes based on this content. In this extended abstract, we scale RDMs and show their capability to generate artistic images as those shown in Fig.…”
mentioning
confidence: 99%