2021
DOI: 10.48550/arxiv.2106.00588
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TransVOS: Video Object Segmentation with Transformers

Jianbiao Mei,
Mengmeng Wang,
Yeneng Lin
et al.

Abstract: Recently, Space-Time Memory Network (STM) based methods have achieved stateof-the-art performance in semi-supervised video object segmentation (VOS). A critical problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformerbased framework, termed TransVOS, introdu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 35 publications
0
8
0
Order By: Relevance
“…Recently, with the observance of its strength in parallel modeling global correlation or attention, transformer blocks were introduced to computer vision tasks, such as image recognition [10], saliency prediction [53], object detection [54,4], and object segmentation [41], where vision transformers have achieved excellent performance compared to the CNN-based counterparts. Researchers then employed transformer architecture into the VOS task [11,23,25,50]. SST [11] adopts the transformer's encoder to compute attention based on the spatialtemporal information among multiple history frames.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, with the observance of its strength in parallel modeling global correlation or attention, transformer blocks were introduced to computer vision tasks, such as image recognition [10], saliency prediction [53], object detection [54,4], and object segmentation [41], where vision transformers have achieved excellent performance compared to the CNN-based counterparts. Researchers then employed transformer architecture into the VOS task [11,23,25,50]. SST [11] adopts the transformer's encoder to compute attention based on the spatialtemporal information among multiple history frames.…”
Section: Related Workmentioning
confidence: 99%
“…In [23], a transductive branch is used to capture the spatial-temporal information, which is integrated with an online inductive branch within a unified framework. TransVOS [25] introduces a transformer-based VOS framework with intuitive structure from the transformer networks in NLP. AOT [50] proposes an Identification Embedding to construct multi-object matching and computes attention for multiple objects simultaneously.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…State-of-the-art STM-cycle [13] adds a cyclic loss to reduce error propagation. Concurrently, some works [7,18] instead use attention via transformers to address pixel spatiotemporal relations and model scalability. While attention works capture higher-level features of foreground objects, their segmentations often lack detail or fail to propagate fine deformations and movements across frames, even when the object does not change much.…”
Section: Related Workmentioning
confidence: 99%