2022
DOI: 10.48550/arxiv.2203.09494
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Transframer: Arbitrary Frame Prediction with Generative Models

Abstract: We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 25 publications
0
9
0
Order By: Relevance
“…However, it is limited to the scenario when an output of a vision task can be manually represented as a short discrete sequence, which is rarely true for vision tasks. In [33] the authors propose a Transframer model, which uses a language model for modeling image outputs represented as sparse discrete cosine transform codes. However, the paper only shows qualitative results for "discriminative" tasks.…”
Section: Related Workmentioning
confidence: 99%
“…However, it is limited to the scenario when an output of a vision task can be manually represented as a short discrete sequence, which is rarely true for vision tasks. In [33] the authors propose a Transframer model, which uses a language model for modeling image outputs represented as sparse discrete cosine transform codes. However, the paper only shows qualitative results for "discriminative" tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, it still underperforms RNN-based baselines in the video Transformers for sequential modeling. Inspired by the success of autoregressive Transformers in language modeling (Radford et al, 2018;Brown et al, 2020), they were adapted to video generation tasks (Yan et al, 2021;Ren & Wang, 2022;Micheli et al, 2022;Nash et al, 2022). To handle the high dimensionality of images, these methods often adopt a two-stage training strategy by first mapping images to discrete tokens (Esser et al, 2021), and then learning a Transformer over tokens.…”
Section: Related Workmentioning
confidence: 99%
“…With the prevalence of Transformers in the NLP field (Vaswani et al, 2017;Kenton & Toutanova, 2019), there have been tremendous efforts in introducing it to computer vision tasks Carion et al, 2020;Liu et al, 2021). Our method is highly motivated by previous works in Transformer-based autoregressive image and video generation (Esser et al, 2021;Chen et al, 2020a;Yan et al, 2021;Nash et al, 2022;Ren & Wang, 2022). VQ-GAN (Esser et al, 2021) first pretrains the encoder, decoder and a codebook that can map images to discrete tokens and tokens back to images.…”
Section: A Additional Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Operating on a compressed space. Directly using compressed representations for downstream tasks for video or image data has primarily been studied by considering standard image and video codecs such as JPEG or MPEG [16,25,65], DCT [40,67] or scattering transforms [43]. However, in general these approaches require devising novel architectures, data pipelines, or training strategies in order to handle these representations.…”
Section: Related Workmentioning
confidence: 99%