CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Ding, Ming; Zheng, Wendi; Hong, Won‐Hwa; Tang, Jie

doi:10.48550/arxiv.2204.14217

Cited by 26 publications

(38 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the text-to-image generation, pretrained autoregressive transformers such as DALL-E [18] and CogView [5] have shown superiority in open-domain image generation. Besides the pure GPT-style generation, CogView2 [6] proposes a new language model CogLM for infilling in the image generation.…”

Section: Autoregressive Transformermentioning

confidence: 99%

“…We train another frame interpolation model to insert transition frames to the generated samples of the sequential generation model. Thanks to the generality of CogLM [6], the two models can share the same structure and training process only with different attention masks.…”

Section: Interpolate Framesmentioning

confidence: 99%

“…However, most previous works use GPT [34,36,35], which is unidirectional. To be aware of the bidirectional context, we adopt Cross-Modal General Language Model (CogLM) proposed in [6] which unites bidirectional context-aware mask prediction and autoregressive generation by dividing tokens into unidirectional and bidirectional attention regions. While bidirectional regions can attend to all bidirectional regions, unidirectional regions can attend to all bidirectional regions and previous unidirectional regions.…”

Section: Interpolate Framesmentioning

confidence: 99%

“…Pretrained text-to-image models, e.g. CogView2 [6], already have a good command of the textimage relations. The coverage of the dataset to train these models is also larger than that of videos.…”

Section: Dual-channel Attentionmentioning

confidence: 99%

“…Here we present a large-scale pretrained text-to-video generative model, CogVideo, which is of 9.4 billion parameters and trained on 5.4 million text-video pairs. We build CogVideo based on a pretrained text-to-image model, CogView2 [6], in order to inherit the knowledge learned from the text-image pretraining. To ensure the alignment between text and its temporal counterparts in the video, we propose the multi-frame-rate hierarchical training.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong¹,

Ding²,

Zheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations. * Equal contribution. Preprint. Under review.

show abstract

Section: Autoregressive Transformermentioning

confidence: 99%

Section: Interpolate Framesmentioning

confidence: 99%

Section: Interpolate Framesmentioning

confidence: 99%

Section: Dual-channel Attentionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong¹,

Ding²,

Zheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Trace Controlled Text to Image Generation

Yan

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Figure 1. (a) ReCo extends pre-trained text-to-image models (Stable Diffusion [33]) with an extra set of input position tokens (in dark blue color) that represent quantized spatial coordinates. Combining position and text tokens yields the region-controlled text input, whichcan specify an open-ended regional description precisely for any image region. (b) With the region-controlled text input, ReCo can better control the object count/relationship/size properties and improve the T2I semantic correctness. We empirically observe that position tokens are less likely to get overlooked than positional text words, especially when the input query is complicated or describes an unusual scene.

show abstract

DEC-205 receptor targeted poly(lactic-co-glycolic acid) nanoparticles containing Eucommia ulmoides polysaccharide enhances the immune response of foot-and-mouth disease vaccine in mice

Feng

Fan

et al. 2023

International Journal of Biological Macromolecules

View full text Add to dashboard Cite

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Cited by 26 publications

References 24 publications

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Trace Controlled Text to Image Generation

DEC-205 receptor targeted poly(lactic-co-glycolic acid) nanoparticles containing Eucommia ulmoides polysaccharide enhances the immune response of foot-and-mouth disease vaccine in mice

Contact Info

Product

Resources

About