TransVOS: Video Object Segmentation with Transformers

Mei, Jianbiao; Wang, Mengmeng; Lin, Yeneng; Yuan, Yi; Liu, Yong

doi:10.48550/arxiv.2106.00588

Cited by 4 publications

(8 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, with the observance of its strength in parallel modeling global correlation or attention, transformer blocks were introduced to computer vision tasks, such as image recognition [10], saliency prediction [53], object detection [54,4], and object segmentation [41], where vision transformers have achieved excellent performance compared to the CNN-based counterparts. Researchers then employed transformer architecture into the VOS task [11,23,25,50]. SST [11] adopts the transformer's encoder to compute attention based on the spatialtemporal information among multiple history frames.…”

Section: Related Workmentioning

confidence: 99%

“…In [23], a transductive branch is used to capture the spatial-temporal information, which is integrated with an online inductive branch within a unified framework. TransVOS [25] introduces a transformer-based VOS framework with intuitive structure from the transformer networks in NLP. AOT [50] proposes an Identification Embedding to construct multi-object matching and computes attention for multiple objects simultaneously.…”

Section: Related Workmentioning

confidence: 99%

“…Then we apply cross-attention and bilateral attention (described below) to it with the reference frame features and add the results. Following the common practice in vision transformers [50,25], we insert layer normalization [1] before and after each attention module. Finally, we employ a two-layer feed-forward MLP block before feeding the output to the next layer.…”

Section: Bilateral Transformer and Bilateral Attentionmentioning

confidence: 99%

“…Due to the absence of class-specific features, VOS models need to match features of the reference frame to that of the query frames both spatially and temporally to capture the class-agnostic correspondence and propagate the segmentation masks. Previous methods attempt to store features from preceding frames in memory networks and match the query frame through a non-local attention mechanism [27,7], or compute a global-to-global attention through an encoder-decoder transformer [25], or propagate and calibrate features from the reference frame to the query frames using a propagation-correction scheme [47]. These methods employ a global attention mechanism to establish correspondence between the full reference frame and the full query frame.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Ye¹,

Yuan²,

Mittal³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video Object Segmentation (VOS) is fundamental to video understanding. Transformer-based methods show significant performance improvement on semi-supervised VOS. However, existing work faces challenges segmenting visually similar objects in close proximity of each other. In this paper, we propose a novel Bilateral Attention Transformer in Motion-Appearance Neighboring space (BATMAN) for semi-supervised VOS. It captures object motion in the video via a novel optical flow calibration module that fuses the segmentation mask with optical flow estimation to improve within-object optical flow smoothness and reduce noise at object boundaries. This calibrated optical flow is then employed in our novel bilateral attention, which computes the correspondence between the query and reference frames in the neighboring bilateral space considering both motion and appearance. Extensive experiments validate the effectiveness of BATMAN architecture by outperforming all existing state-of-the-art on all four popular VOS benchmarks: Youtube-VOS 2019 (85.0%), Youtube-VOS 2018 (85.3%), DAVIS 2017Val/Testdev (86.2%/82.2%), and DAVIS 2016 (92.5%).

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Bilateral Transformer and Bilateral Attentionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Ye¹,

Yuan²,

Mittal³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…State-of-the-art STM-cycle [13] adds a cyclic loss to reduce error propagation. Concurrently, some works [7,18] instead use attention via transformers to address pixel spatiotemporal relations and model scalability. While attention works capture higher-level features of foreground objects, their segmentations often lack detail or fail to propagate fine deformations and movements across frames, even when the object does not change much.…”

Section: Related Workmentioning

confidence: 99%

FlowVOS: Weakly-Supervised Visual Warping for Detail-Preserving and Temporally Consistent Single-Shot Video Object Segmentation

Gong¹,

Holsinger²,

Yeung³

2021

Preprint

View full text Add to dashboard Cite

We consider the task of semi-supervised video object segmentation (VOS). Our approach mitigates shortcomings in previous VOS work by addressing detail preservation and temporal consistency using visual warping. In contrast to prior work that uses full optical flow, we introduce a new foreground-targeted visual warping approach that learns flow fields from VOS data. We train a flow module to capture detailed motion between frames using two weakly-supervised losses. Our object-focused approach of warping previous foreground object masks to their positions in the target frame enables detailed mask refinement with fast runtimes without using extra flow supervision. It can also be integrated directly into state-of-the-art segmentation networks. On the DAVIS17 and YouTubeVOS benchmarks, we outperform state-of-the-art offline methods that do not use extra data, as well as many online methods that use extra data. Qualitatively, we also show our approach produces segmentations with high detail and temporal consistency.

show abstract

Two-Stream Networks for Object Segmentation in Videos

Lu¹,

Tian²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing matching-based approaches perform video object segmentation (VOS) via retrieving support features from a pixel-level memory, while some pixels may suffer from lack of correspondence in the memory (i.e., unseen), which inevitably limits their segmentation performance. In this paper, we present a Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a conventional pixel-level memory, to segment the seen pixels based on their pixellevel memory retrieval. (ii) an instance stream for the unseen pixels, where a holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance. (iii) a pixel division module generating a routing map, with which output embeddings of the two streams are fused together. The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost. Through extensive experiments, we demonstrate the effectiveness of our proposed TSN, and we also report state-of-the-art per-

show abstract

TransVOS: Video Object Segmentation with Transformers

Cited by 4 publications

References 35 publications

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

FlowVOS: Weakly-Supervised Visual Warping for Detail-Preserving and Temporally Consistent Single-Shot Video Object Segmentation

Two-Stream Networks for Object Segmentation in Videos

Contact Info

Product

Resources

About