2021
DOI: 10.48550/arxiv.2112.02815
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Make It Move: Controllable Image-to-Video Generation with Text Descriptions

Abstract: Generating controllable videos conforming to user intentions is an appealing yet challenging topic in computer vision. To enable maneuverable control in line with user intentions, a novel video generation task, named Text-Imageto-Video generation (TI2V), is proposed. With both controllable appearance and motion, TI2V aims at generating videos from a static image and a text description. The key challenges of TI2V task lie both in aligning appearance and motion from different modalities, and in handling uncertai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…Therefore, TVP could be regarded as a more controllable and plausible task. The concurrent work [24] proposes a similar setting and adopts a VQ-VAE based framework, which is totally different from our GAN-based inference network. Besides, this work [24] uses the first image and text to produce a motion anchor, thus guiding the generation for all subsequent frames.…”
Section: B Image-to-video Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, TVP could be regarded as a more controllable and plausible task. The concurrent work [24] proposes a similar setting and adopts a VQ-VAE based framework, which is totally different from our GAN-based inference network. Besides, this work [24] uses the first image and text to produce a motion anchor, thus guiding the generation for all subsequent frames.…”
Section: B Image-to-video Generationmentioning
confidence: 99%
“…The concurrent work [24] proposes a similar setting and adopts a VQ-VAE based framework, which is totally different from our GAN-based inference network. Besides, this work [24] uses the first image and text to produce a motion anchor, thus guiding the generation for all subsequent frames. In contrast, our framework fully explores the inference ability of text on motion information to generate step-wise embeddings specifically for each subsequent frame.…”
Section: B Image-to-video Generationmentioning
confidence: 99%