2020
DOI: 10.48550/arxiv.2011.14503
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Video Instance Segmentation with Transformers

Abstract: Without bells and whistles, VisTR achieves the highest speed among all existing VIS models, and achieves the best result among methods using single model on the YouTube-VIS dataset. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy. We hope that VisTR can motivate future research for more video understanding tasks.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
63
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 56 publications
(63 citation statements)
references
References 27 publications
0
63
0
Order By: Relevance
“…Recently, Vision Transformers [10], [11], [40], [41], [47], [54] make a great progress the computer vision. It can be mainly divided into two directions: replacing CNN backbone with Transformer-Like architecture [40], [47], [55] and using object query to represent instance for scene understanding [10], [11], [39]. Our work is related to the second part.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, Vision Transformers [10], [11], [40], [41], [47], [54] make a great progress the computer vision. It can be mainly divided into two directions: replacing CNN backbone with Transformer-Like architecture [40], [47], [55] and using object query to represent instance for scene understanding [10], [11], [39]. Our work is related to the second part.…”
Section: Related Workmentioning
confidence: 99%
“…We suppose that this way can not fully use enough temporal contexts from a video clip. VisTR [39] views the VIS task as a direct end-to-end parallel sequence prediction problem. The targets of a clip are disrupted in such an instance sequence and directly performing target assignment is not optimal.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, the Vision Transformer (ViT) structure (Dosovitskiy et al, 2020) firstly introduces the transformers into the image classification task and suggests that a convolution-free architecture can achieve state-of-the-art performance. Later on, transformers have been widely used in several vision tasks performance, such as detection (Carion et al, 2020;Zhu et al, 2020), segmentation (Zheng et al, 2020;Wang et al, 2020), and pose estimation (Lin et al, 2020;Li et al, 2021). However, the excellent performance improvement requires increasing the model size and computation complexity, and it is difficult to deploy these huge models into real-world applications like augmented reality and autonomous driving.…”
Section: Introductionmentioning
confidence: 99%