FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Li, Rui; Deng, Hanming; Huang, Yangyi; Shi, Xiaoyu; Lu, Lewei; Sun, Wenxiu; Wang, Xiaogang; Dai, Jifeng; Li, Hongsheng

doi:10.1109/iccv48922.2021.01378

Cited by 116 publications

(108 citation statements)

References 21 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…Following ViT, many transformer-based architectures such as PCT [27], IPT [79], T2T-ViT [44], DeepViT [167], SETR [81], PVT [45], CaiT [168], TNT [82], Swin-transformer [46], Query2Label [83], MoCoV3 [84], BEiT [85], SegFormer [86], FuseFormer [169], and MAE [170] have appeared, with excellent results for many kind of visual tasks including image classification, object detection, semantic segmentation, point cloud processing, action recognition, and self-supervised learning.…”

Section: Vision Transformersmentioning

confidence: 99%

Attention mechanisms in computer vision: A survey

et al. 2022

View full text Add to dashboard Cite

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multimodal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

show abstract

Section: Vision Transformersmentioning

confidence: 99%

Attention mechanisms in computer vision: A survey

et al. 2022

View full text Add to dashboard Cite

show abstract

“…All these works have consistently found promising generalization capabilities of Transformer architectures. Nevertheless, VTs are still a mystery with regards to this, and are limited to few works which have tested their model on OOD data [13], [62], [68], [69], [115], [126] or evaluated the learned features in other settings [50], [52], [71], [88]. While we expect them to follow the same trend as other modalities, further research is needed.…”

Section: The Road Aheadmentioning

confidence: 99%

“…Minimal embeddings. Inspired by the success of ViT [7], few video methods omit the use of large backbones and perform linear projections or convolutions instead, in order to embed tokens representing smaller portions of the input video [7], [9], [11], [88], [115], [130]. Empirical studies like [9], [130], show that stand-alone Transformers (i.e., without complex CNN backbones) are as performant as CNN counterparts, even at the expense of high computational and data resources.…”

Section: Embeddingmentioning

confidence: 99%

“…Finally, [10], [174] report having substantial computational resources available, which allowed them to fit in memory both, a large backbone and a big Transformer. For some designs it is more natural to train end-to-end as they embed Transformer layers within the backbone, either to enhance local convolutional features across multiple ResNet streams [132], or to enhance feature representations between a CNN encoder and decoder [114], [115]. Pre-trained Transformers.…”

Section: Training Regimementioning

confidence: 99%

“…Aside from very few exceptions, we see VTs applied to high level tasks only. Low-level tasks for video, such as frame generation [60], [61] or inpainting [114], [115], are much more challenging given the difficulty and computational complexity of generating low-level detail in a highly dimensional space. Still, we hypothesize that, when properly handled, the long-range modeling capabilities of Transformers could greatly benefit these tasks.…”

Section: The Road Aheadmentioning

confidence: 99%

See 2 more Smart Citations

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

show abstract

Error Compensation Framework for Flow-Guided Video Inpainting

Kang

Kim

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Cited by 116 publications

References 21 publications

Attention mechanisms in computer vision: A survey

Attention mechanisms in computer vision: A survey

Video Transformers: A Survey

Error Compensation Framework for Flow-Guided Video Inpainting

Contact Info

Product

Resources

About