2022
DOI: 10.1007/978-3-031-19787-1_41
|View full text |Cite
|
Sign up to set email alerts
|

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
41
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 109 publications
(41 citation statements)
references
References 18 publications
0
41
0
Order By: Relevance
“…Precursor to text-conditioned audio synthesis are the textconditioned image generation models, which made significant progress in quality due to architectural improvements and the availability of massive, high-quality paired training data. (Wu et al, 2022a;Hong et al, 2022;Villegas et al, 2022;Ho et al, 2022).…”
Section: Text-conditioned Image Generationmentioning
confidence: 99%
“…Precursor to text-conditioned audio synthesis are the textconditioned image generation models, which made significant progress in quality due to architectural improvements and the availability of massive, high-quality paired training data. (Wu et al, 2022a;Hong et al, 2022;Villegas et al, 2022;Ho et al, 2022).…”
Section: Text-conditioned Image Generationmentioning
confidence: 99%
“…According to the two timescale update rule (TTUR) [12], the learning rate is set to 0.0001 for the generator and 0.0004 for the discriminator. Following the previous text-to-image works [42,47,48,57], we adopt the Fréchet Inception Distance (FID) [12] and CLIPSIM [47] to evaluate the image fidelity and text-image semantic consistency. All GALIP models are trained on 8×3090 GPUs.…”
Section: Methodsmentioning
confidence: 99%
“…Specifically, DALL-E (Ramesh et al, 2021) demonstrates that training a large-scale auto-regressive Transformer on numerous image-text pairs can result in a high-fidelity generative model with controllable synthesis results through text prompts. NUWA (Wu et al, 2022b) presents a unified multimodal pretrained model that allows to generate or manipulate visual data (i.e., images and videos) with a 3D transformer encoder-decoder framework and a 3D Nearby Attention (3DNA) mechanism. In NUWA-Inifinity (Wu et al, 2022a), the authors further propose an autoregressive over autoregressive generation method for high-resolution infinite visual synthesis, which is capable of generating Bi-Directional Image-Text Generation.…”
Section: Vq-token-based Auto-regressive Methodsmentioning
confidence: 99%