2022
DOI: 10.48550/arxiv.2205.09853
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Abstract: Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a generalpurpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks usin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 42 publications
0
11
0
Order By: Relevance
“…All Layers (𝑀 𝑆𝑆𝐴𝐿 ). Following [52], we explore the case of time-conditioning each layer as long as it is onedimensional: e.g. in the frame model (Fig.…”
Section: Scale and Shiftmentioning
confidence: 99%
“…All Layers (𝑀 𝑆𝑆𝐴𝐿 ). Following [52], we explore the case of time-conditioning each layer as long as it is onedimensional: e.g. in the frame model (Fig.…”
Section: Scale and Shiftmentioning
confidence: 99%
“…The video prediction task refers to the problem of generating videos conditioned on past frames [18,19], possibly with an additional natural language description [20-22, 22, 23] and/or motor commands [24][25][26][27]. Multiple classes of generative models have been utilized to tackle this problem, such as Generative adversarial networks (GANs) [28][29][30], Variational Autoencoders (VAEs) [31,32,[25][26][27], VQ-VAEs [33,34] and diffusion models [35,36]. Our work focuses on predicting future frames conditioned on past frames or motor commands and belongs to the family of two-stage methods that first encode the videos into a downsampled latent space and then use transformers to model an autoregressive prior [33,34].…”
Section: Related Workmentioning
confidence: 99%
“…Almost all the VFP models are autoregressive models based on ConvLSTMs or Transformers [35,36]. Recently, a few promising nonautoregressive VFP models were proposed [37,38,39]. By combining ConvLSTMs with an neural ordinary differential equation (ODE) solver, Vid-ODE [16] is the first method that unifies the VFP and VFI into a single model, and it is able to generate temporally continuous video.…”
Section: Related Workmentioning
confidence: 99%
“…Another work, masked conditional video diffusion (MCVD) [39] extends the 3D CNN-based diffusion models for video generation, but it is not an NP model and it is not able to do continuous video synthesis. In contrast, benefiting from the flexibility of NPs, our model is able to perform video random missing frames completion (VRC), which is more flexible than MCVD.…”
Section: Related Workmentioning
confidence: 99%