2022
DOI: 10.48550/arxiv.2205.16007
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improved Vector Quantized Diffusion Models

Abstract: Vector quantized diffusion (VQ-Diffusion) is a powerful generative model for textto-image synthesis, but sometimes can still generate low-quality samples or weakly correlated images with text input. We find these issues are mainly due to the flawed sampling strategy. In this paper, we propose two important techniques to further improve the sample quality of VQ-Diffusion. 1) We explore classifier-free guidance sampling for discrete denoising diffusion model and propose a more general and effective implementatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(14 citation statements)
references
References 36 publications
0
10
0
Order By: Relevance
“…These KNNs form a context key/value store for a standard cross-attention layer [21], where the queries are the incoming audio frame embeddings. effective for QA [24,25], image captioning [26] and other tasks [27,28,29,30,31,32].…”
Section: Related Workmentioning
confidence: 99%
“…These KNNs form a context key/value store for a standard cross-attention layer [21], where the queries are the incoming audio frame embeddings. effective for QA [24,25], image captioning [26] and other tasks [27,28,29,30,31,32].…”
Section: Related Workmentioning
confidence: 99%
“…They train a conditional model both conditionally and unconditionally by randomly substituting a class label with the null class label during the training phase. Due to its simplicity of implementation and effectiveness, this method has been widely used in high-quality diffusion models (Ramesh et al 2022;Rombach et al 2022;Tang et al 2022;Wang et al 2022;.…”
Section: Related Workmentioning
confidence: 99%
“…DINO (Caron et al 2021) showed that the self-attention maps of the selfsupervised transformers have an object-oriented property, and are effective for the object-oriented tasks such as semantic segmentation and video object segmentation. Specifically, several works (Van Gansbeke, Vandenhende, and Van Gool 2022;Zadaianchuk et al 2022) use the attention…”
Section: Introductionmentioning
confidence: 99%
“…Within this line of research, GLIDE [169] compares CLIP guidance and classifier-free guidance in diffusion models for the textguided image synthesis, and concludes that classifier-free guidance yields better performance and a diffusion model of 3.5 billion parameters outperforms DALL-E in terms of human evaluations. Besides, Tang et al [170] explore classifierfree guidance sampling for discrete denoising diffusion model with the introduction of an effective implementation of classifier-free guidance.…”
Section: Conditional Diffusion Modelsmentioning
confidence: 99%
“…Similarly, Gu et al [171] present a vector quantized diffusion (VQ-Diffusion) model for text-to-image generation by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM). Tang et al [170] further improve VQ-Diffusion by introducing a high-quality inference strategy to alleviate the joint distribution issue. Following VQ-Diffusion, Text2Human [178] is introduced to achieve high-quality text-driven human generation by employing a diffusion-based Transformer to model a hierarchical discrete latent space.…”
Section: Conditional Diffusion Modelsmentioning
confidence: 99%