2022
DOI: 10.48550/arxiv.2202.04200
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MaskGIT: Masked Generative Image Transformer

Abstract: Figure 1. Example generation by MaskGIT on image synthesis and manipulation tasks. We show that MaskGIT is a flexible model that can generate high-quality samples on (a) class-conditional synthesis, (b) class-conditional image manipulation, e.g. replacing selected objects in the bounding box with ones from the given classes, and (c) image extrapolation. Examples shown here have resolutions 512ˆ512, 512ˆ512, and 512ˆ2560 in the three columns, respectively. Zoom in to see the details.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
22
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(24 citation statements)
references
References 18 publications
1
22
1
Order By: Relevance
“…Bidirectional transformer for autoregressive generation is recently proposed by MaskGIT [4]. Here we show our model can also be combined with bidirectional transformer.…”
Section: Bidirectional Transformermentioning
confidence: 87%
See 1 more Smart Citation
“…Bidirectional transformer for autoregressive generation is recently proposed by MaskGIT [4]. Here we show our model can also be combined with bidirectional transformer.…”
Section: Bidirectional Transformermentioning
confidence: 87%
“…Tokens are sampled based on their probabilities (transformer output). See [4] for a detailed explanation. We show visualization of decoding steps in Fig.…”
Section: Bidirectional Transformermentioning
confidence: 99%
“…The speed of generation was not an issue at the era when GANs dominated the image generation, while constitutes a considerate challenge for current auto-regressive text-to-image models. M6-UFC [33] first introduces NAR methods into the VQ-VAE framework, and similar ideas are adopted by VQ-diffusion [11] and MaskGIT [1]. A possible drawback of pure NAR methods is that tokens sampled at the meantime might lead to global inconsistency in later steps during the generation of complex scenes.…”
Section: Related Workmentioning
confidence: 99%
“…The goal of MJP is to enhance the capacity of position insensitivity that directly increases the difficulties of image recovery from gradient updates, while to preserve the accuracy on the standard classification in the pre-training. In practice, image masking methods are commonly applied in recent vision task [2,57,20,58,7], which is a useful and off-the-shelf strategy for self-supervised image reconstruction. Given an input image x ∈ R H×W ×C , we reshape it into a sequence of flattened 2D patches x p ∈ R N ×(P 2 •C) , where (H, W ) is the resolution of the original image, C is the number of channels, (P, P ) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches.…”
Section: Masked Jigsaw Puzzlementioning
confidence: 99%