Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

Liang, Yongqing; Li, Xin; Jafari, Navid H.; Chen, Qin

doi:10.48550/arxiv.2010.07958

Cited by 3 publications

(10 citation statements)

References 33 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Matching-based methods. Recently, state-of-the-art performance has been achieved by matchingbased methods [9,32,43,25,22,18,29], which perform feature matching to learn target object appearances offline. FEELVOS [32] and CFBI [43] perform the nearest neighbor matching between the current frame and the first and previous frames in the feature space.…”

Section: Related Workmentioning

confidence: 99%

“…Even though, TransVOS can still learn long-term dependency. Just like most STM-based methods [25,16,22,18,29], we synthesis video clips by applying data augmentations (random affine, color, flip, resize and crop) on a static image of datasets [4,20,15,7]. Then we use the synthetic videos to pretrain our model.…”

Section: Training and Inferencementioning

confidence: 99%

“…The latter is the relationships among pixels in one specific frame, including object appearance information for target localization and segmentation, which is important for learning local target object structure and helps obtain accurate masks. Recently, a group of matching-based methods [9,32,43,25,16,22,13,18,29] provide partial solutions for capturing above correspondence and achieve state-of-the-art performance. The basic idea of these methods is to compute the similarities of target objects between the current and past frames by feature matching, in which attention mechanism is widely used.…”

Section: Introductionmentioning

confidence: 99%

“…The basic idea of these methods is to compute the similarities of target objects between the current and past frames by feature matching, in which attention mechanism is widely used. Among them, the Space-Time Memory (STM) based approaches [25,16,22,13,18,29] have achieved great success. They propose to apply spatio-temporal attention between every pixel in previous frames and every pixel in the current frame.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

TransVOS: Video Object Segmentation with Transformers

Mei,

Wang,

Lin

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, Space-Time Memory Network (STM) based methods have achieved stateof-the-art performance in semi-supervised video object segmentation (VOS). A critical problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformerbased framework, termed TransVOS, introducing a vision transformer to fully exploit and model both the temporal and spatial relationships. Moreover, most STM-based approaches employ two disparate encoders to extract features of two significant inputs, i.e., reference sets (history frames with predicted masks) and query frame, respectively, increasing the models' parameters and complexity. To slim the popular two-encoder pipeline while keeping the effectiveness, we design a single two-path feature extractor to encode the above two inputs in a unified way. Extensive experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. Codes will be released when it is published.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Training and Inferencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

TransVOS: Video Object Segmentation with Transformers

Mei,

Wang,

Lin

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Semi-supervised video object segmentation. Following the taxonomy proposed by [19], recent VOS methods can be categorized into implicit and explicit according to the approach followed to address the problem.…”

Section: Introductionmentioning

confidence: 99%

Adaptive Memory Management for Video Object Segmentation

Ali¹,

Poullis²

2022

Preprint

View full text Add to dashboard Cite

Matching-based networks have achieved state-ofthe-art performance for video object segmentation (VOS) tasks by storing every-k frames in an external memory bank for future inference. Storing the intermediate frames' predictions provides the network with richer cues for segmenting an object in the current frame. However, the size of the memory bank gradually increases with the length of the video, which slows down inference speed and makes it impractical to handle arbitrary length videos.This paper proposes an adaptive memory bank strategy for matching-based networks for semi-supervised video object segmentation (VOS) that can handle videos of arbitrary length by discarding obsolete features. Features are indexed based on their importance in the segmentation of the objects in previous frames. Based on the index, we discard unimportant features to accommodate new features. We present our experiments on DAVIS 2016, DAVIS 2017, and Youtube-VOS that demonstrate that our method outperforms state-of-the-art that employ first-and-latest strategy with fixed-sized memory banks and achieves comparable performance to the every-k strategy with increasing-sized memory banks. Furthermore, experiments show that our method increases inference speed by up to 80% over the every-k and 35% over first-and-latest strategies.

show abstract

Tackling Background Distraction in Video Object Segmentation

Cho¹,

Heansung²,

Minhyeok³

et al. 2022

Preprint

View full text Add to dashboard Cite

Semi-supervised video object segmentation (VOS) aims to densely track certain designated objects in videos. One of the main challenges in this task is the existence of background distractors that appear similar to the target objects. We propose three novel strategies to suppress such distractors: 1) a spatio-temporally diversified template construction scheme to obtain generalized properties of the target objects; 2) a learnable distance-scoring function to exclude spatially-distant distractors by exploiting the temporal consistency between two consecutive frames; 3) swap-and-attach augmentation to force each object to have unique features by providing training samples containing entangled objects. On all public benchmark datasets, our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance. Qualitative results also demonstrate the superiority of our approach over existing methods. We believe our approach will be widely used for future VOS research. Code and models are available at https://github.com/suhwan-cho/TBD.

show abstract

Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

Cited by 3 publications

References 33 publications

TransVOS: Video Object Segmentation with Transformers

TransVOS: Video Object Segmentation with Transformers

Adaptive Memory Management for Video Object Segmentation

Tackling Background Distraction in Video Object Segmentation

Contact Info

Product

Resources

About