2021
DOI: 10.48550/arxiv.2110.04627
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vector-quantized Image Modeling with Improved VQGAN

Abstract: Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformerbased VQGAN (ViT-VQGAN). We first pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
59
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 32 publications
(60 citation statements)
references
References 31 publications
1
59
0
Order By: Relevance
“…VQGAN [15] adds adversarial loss and perceptual loss [26,54] in the first stage to improve the image fidelity. A contemporary work to ours, VIM [51], proposes to use a VIT backbone [13] to further improve the tokenization stage. Since these approaches still employ an auto-regressive model, the decoding time in the second stage scales with the token sequence length.…”
Section: Image Synthesismentioning
confidence: 99%
“…VQGAN [15] adds adversarial loss and perceptual loss [26,54] in the first stage to improve the image fidelity. A contemporary work to ours, VIM [51], proposes to use a VIT backbone [13] to further improve the tokenization stage. Since these approaches still employ an auto-regressive model, the decoding time in the second stage scales with the token sequence length.…”
Section: Image Synthesismentioning
confidence: 99%
“…In this paper, we directly use the same image tokenizer as BEiT for a fair and clear comparison with recent works. Meanwhile, we believe other image tokenizers, such as (Esser et al, 2021;Dong et al, 2021;Yu et al, 2021), deserve an indepth study for CIM pre-training in the future.…”
Section: Sampling Strategymentioning
confidence: 94%
“…In this paper, we focus more on the architectural flexibility and universality of CIM, while the scaling behavior is not fully explored. The image tokenizer we use is essentially a large CNN and adds nontrivial overhead during pre-training, we believe that it can be largely resolved by using a more advanced tokenizer, such as ViT-VQGAN (Yu et al, 2021), which reports much higher throughput and better generation quality. Moreover, the influence of corrupted images' characteristics, styles and distributions on the pre-trained representation quality still needs more investigation.…”
Section: Limitations and Future Researchmentioning
confidence: 99%
See 1 more Smart Citation
“…In above approaches, a convolution neural network (CNN) is learned to quantize and generate images. Instead, Yu et al [144] propose ViT-VQGAN which replaces the CNN encoder and decoder with Vision Transformer (ViT) [145]. Given sufficient data (for which unlabeled image data is plentiful), ViT-VQGAN is shown to be less constrained by the inductive priors imposed by convolutions.…”
Section: Discrete Vector Representationmentioning
confidence: 99%