“…Unlike the minimax game in popular GAN models [13,1,20,21], the VQ-based generator is trained by optimizing negative log-likelihood over all examples in the training set, leading to a stable training and bypassing the "mode collapse" issue. Driven by these advantages, many image synthesis models follow the two-stage paradigm, such as image generation [31,45,2,24,16], image-to-image translation [11,10,32], text-to-image synthesis [30,29,10,7], conditional video generation [28,42,44], and image completion [11,10,47]. Apart from VQGAN, the most related works also include ViT-VQGAN [45] and RQ-VAE [24] that aim to train a better quantizer in the first stage.…”