Vector-quantized Image Modeling with Improved VQGAN

Yu, Jiahui; Li, Xin; Koh, Jing Yu; Zhang, Han; Pang, Ruoming; Qin, James; Ku, Alexander; Xu, Yuanzhong; Baldridge, Jason; Wu, Yonghui

doi:10.48550/arxiv.2110.04627

Cited by 32 publications

(60 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VQGAN [15] adds adversarial loss and perceptual loss [26,54] in the first stage to improve the image fidelity. A contemporary work to ours, VIM [51], proposes to use a VIT backbone [13] to further improve the tokenization stage. Since these approaches still employ an auto-regressive model, the decoding time in the second stage scales with the token sequence length.…”

Section: Image Synthesismentioning

confidence: 99%

MaskGIT: Masked Generative Image Transformer

Chang¹,

Zhang²,

J³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Figure 1. Example generation by MaskGIT on image synthesis and manipulation tasks. We show that MaskGIT is a flexible model that can generate high-quality samples on (a) class-conditional synthesis, (b) class-conditional image manipulation, e.g. replacing selected objects in the bounding box with ones from the given classes, and (c) image extrapolation. Examples shown here have resolutions 512ˆ512, 512ˆ512, and 512ˆ2560 in the three columns, respectively. Zoom in to see the details.

show abstract

Section: Image Synthesismentioning

confidence: 99%

MaskGIT: Masked Generative Image Transformer

Chang¹,

Zhang²,

J³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper, we directly use the same image tokenizer as BEiT for a fair and clear comparison with recent works. Meanwhile, we believe other image tokenizers, such as (Esser et al, 2021;Dong et al, 2021;Yu et al, 2021), deserve an indepth study for CIM pre-training in the future.…”

Section: Sampling Strategymentioning

confidence: 94%

“…In this paper, we focus more on the architectural flexibility and universality of CIM, while the scaling behavior is not fully explored. The image tokenizer we use is essentially a large CNN and adds nontrivial overhead during pre-training, we believe that it can be largely resolved by using a more advanced tokenizer, such as ViT-VQGAN (Yu et al, 2021), which reports much higher throughput and better generation quality. Moreover, the influence of corrupted images' characteristics, styles and distributions on the pre-trained representation quality still needs more investigation.…”

Section: Limitations and Future Researchmentioning

confidence: 99%

“…For the first time, we demonstrate that both ViT and CNN can learn rich visual representations using a unified non-Siamese structure. Moreover, the components of CIM, such as the generator, the image tokenizer (Esser et al, 2021;Yu et al, 2021;Dong et al, 2021), the sampling method (Holtzman et al, 2019), as well as the pre-training objective (Wei et al, 2021) can be further customized and improved.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Fang¹,

Liu²,

Bao³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation. For example, 300-epoch CIM pretrained vanilla ViT-Base/16 and ResNet-50 obtain 83.3 and 80.6 Top-1 fine-tuning accuracy on ImageNet-1K image classification respectively. * Contribution during internship at Microsoft.

show abstract

“…In above approaches, a convolution neural network (CNN) is learned to quantize and generate images. Instead, Yu et al [144] propose ViT-VQGAN which replaces the CNN encoder and decoder with Vision Transformer (ViT) [145]. Given sufficient data (for which unlabeled image data is plentiful), ViT-VQGAN is shown to be less constrained by the inductive priors imposed by convolutions.…”

Section: Discrete Vector Representationmentioning

confidence: 99%

Multimodal Image Synthesis and Editing: A Survey

Zhan¹,

Yu²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modelling the interaction among multimodal information, multimodal image synthesis and editing have become a hot research topic in recent years. Different from traditional visual guidance which provides explicit clues, multimodal guidance offers intuitive and flexible means in image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of features with inherent modality gaps, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis & editing and formulate taxonomies according to data modality and model architectures. We start with an introduction to different types of guidance modalities in image synthesis and editing. We then describe multimodal image synthesis and editing approaches extensively with detailed frameworks including Generative Adversarial Networks (GANs), GAN Inversion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corresponding evaluation metrics as widely adopted in multimodal image synthesis and editing, as well as detailed comparisons of different synthesis methods with analysis of respective advantages and limitations. Finally, we provide insights into the current research challenges and possible future research directions. We hope this survey could lay a sound and valuable foundation for future development of multimodal image synthesis and editing. A project associated with this survey is available at https://github.com/fnzhan/MISE.

show abstract

Vector-quantized Image Modeling with Improved VQGAN

Cited by 32 publications

References 31 publications

MaskGIT: Masked Generative Image Transformer

MaskGIT: Masked Generative Image Transformer

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Multimodal Image Synthesis and Editing: A Survey

Contact Info

Product

Resources

About