Transframer: Arbitrary Frame Prediction with Generative Models

Nash, Charlie; Carreira, João; Walker, Jacob; Barr, Iain; Jaegle, Andrew; Malinowski, Mateusz; Battaglia, Peter W.

doi:10.48550/arxiv.2203.09494

Cited by 7 publications

(11 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it is limited to the scenario when an output of a vision task can be manually represented as a short discrete sequence, which is rarely true for vision tasks. In [33] the authors propose a Transframer model, which uses a language model for modeling image outputs represented as sparse discrete cosine transform codes. However, the paper only shows qualitative results for "discriminative" tasks.…”

Section: Related Workmentioning

confidence: 99%

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Kolesnikov¹,

Pinto²,

Beyer³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feedforward) which is trained to directly predict raw vision outputs, guided by a learned discrete code and (II) a language model (autoregressive) that is trained to generate the guiding code. These components complement each other: the language model is well-suited to modeling structured interdependent data, while the base model is efficient at dealing with high-dimensional outputs. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks: panoptic segmentation, depth prediction and image colorization, where we achieve competitive and near state-of-the-art results. Our experimental results suggest that UViM is a promising candidate for a unified modeling approach in computer vision.

show abstract

Section: Related Workmentioning

confidence: 99%

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Kolesnikov¹,

Pinto²,

Beyer³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, it still underperforms RNN-based baselines in the video Transformers for sequential modeling. Inspired by the success of autoregressive Transformers in language modeling (Radford et al, 2018;Brown et al, 2020), they were adapted to video generation tasks (Yan et al, 2021;Ren & Wang, 2022;Micheli et al, 2022;Nash et al, 2022). To handle the high dimensionality of images, these methods often adopt a two-stage training strategy by first mapping images to discrete tokens (Esser et al, 2021), and then learning a Transformer over tokens.…”

Section: Related Workmentioning

confidence: 99%

“…With the prevalence of Transformers in the NLP field (Vaswani et al, 2017;Kenton & Toutanova, 2019), there have been tremendous efforts in introducing it to computer vision tasks Carion et al, 2020;Liu et al, 2021). Our method is highly motivated by previous works in Transformer-based autoregressive image and video generation (Esser et al, 2021;Chen et al, 2020a;Yan et al, 2021;Nash et al, 2022;Ren & Wang, 2022). VQ-GAN (Esser et al, 2021) first pretrains the encoder, decoder and a codebook that can map images to discrete tokens and tokens back to images.…”

Section: A Additional Related Workmentioning

confidence: 99%

“…Then, a GPT-like Transformer model is trained to autoregressively predict the input tokens for high-fidelity image generation. Transframer (Nash et al, 2022) instead discretizes video frames using Discrete Cosine Transform (DCT), and learns an autoregressive Transformer over these sparse representations from multiple frames. The design of SlotFormer is mostly related to (Ren & Wang, 2022), which also uses image tokens from multiple frames to enable consistent long-term view synthesis.…”

Section: A Additional Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Wu¹,

Dvornik²,

Greff³

et al. 2022

Preprint

View full text Add to dashboard Cite

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without objectlevel labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks. Additional results and details are available at our Website.

show abstract

“…Operating on a compressed space. Directly using compressed representations for downstream tasks for video or image data has primarily been studied by considering standard image and video codecs such as JPEG or MPEG [16,25,65], DCT [40,67] or scattering transforms [43]. However, in general these approaches require devising novel architectures, data pipelines, or training strategies in order to handle these representations.…”

Section: Related Workmentioning

confidence: 99%

Compressed Vision for Efficient Video Understanding

Wiles¹,

Carreira²,

Barr³

et al. 2022

Preprint

View full text Add to dashboard Cite

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -data transfer, speed and memory -making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.

show abstract

Transframer: Arbitrary Frame Prediction with Generative Models

Cited by 7 publications

References 25 publications

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Compressed Vision for Efficient Video Understanding

Contact Info

Product

Resources

About