2022
DOI: 10.48550/arxiv.2205.15868
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Abstract: Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 22 publications
(44 citation statements)
references
References 28 publications
0
44
0
Order By: Relevance
“…Precursor to text-conditioned audio synthesis are the textconditioned image generation models, which made significant progress in quality due to architectural improvements and the availability of massive, high-quality paired training data. (Wu et al, 2022a;Hong et al, 2022;Villegas et al, 2022;Ho et al, 2022).…”
Section: Text-conditioned Image Generationmentioning
confidence: 99%
“…Precursor to text-conditioned audio synthesis are the textconditioned image generation models, which made significant progress in quality due to architectural improvements and the availability of massive, high-quality paired training data. (Wu et al, 2022a;Hong et al, 2022;Villegas et al, 2022;Ho et al, 2022).…”
Section: Text-conditioned Image Generationmentioning
confidence: 99%
“…For example, an editing system could describe the impact of an applied effect on the visual content in the video (e.g., "the vignette effect now covers the hands") using techniques from prior work in BLV visual design authoring [61] and computer vision approaches for captioning differences between pairs of similar images [32]. Recent strides in prompt-driven text generation [11], image generation [62,68], and image editing [52] suggest that prompt-driven video editing (e.g., make this clip moody) may be possible in the future [26]. Future research is needed to help BLV creators evaluate their results with such tools.…”
Section: Discussion and Future Workmentioning
confidence: 99%
“…Phenaki [38] introduces a bidirectional masked transformer with a causal attention mechanism that allows the generation of arbitrary-long videos from text prompt sequences. CogVideo [15] extends the text-to-image model CogView 2 [4] by tuning it using a multi-frame-rate hierarchical training strategy to better align text and video clips. Video Diffusion Models (VDM) [14] naturally extend text-to-image diffusion models and train jointly on image and video data.…”
Section: Text-to-video Generationmentioning
confidence: 99%