2022
DOI: 10.48550/arxiv.2204.14217
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Abstract: The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel autoregressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
38
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 26 publications
(38 citation statements)
references
References 24 publications
0
38
0
Order By: Relevance
“…In the text-to-image generation, pretrained autoregressive transformers such as DALL-E [18] and CogView [5] have shown superiority in open-domain image generation. Besides the pure GPT-style generation, CogView2 [6] proposes a new language model CogLM for infilling in the image generation.…”
Section: Autoregressive Transformermentioning
confidence: 99%
See 4 more Smart Citations
“…In the text-to-image generation, pretrained autoregressive transformers such as DALL-E [18] and CogView [5] have shown superiority in open-domain image generation. Besides the pure GPT-style generation, CogView2 [6] proposes a new language model CogLM for infilling in the image generation.…”
Section: Autoregressive Transformermentioning
confidence: 99%
“…We train another frame interpolation model to insert transition frames to the generated samples of the sequential generation model. Thanks to the generality of CogLM [6], the two models can share the same structure and training process only with different attention masks.…”
Section: Interpolate Framesmentioning
confidence: 99%
See 3 more Smart Citations