2022
DOI: 10.48550/arxiv.2207.06953
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Tackling Background Distraction in Video Object Segmentation

Abstract: Semi-supervised video object segmentation (VOS) aims to densely track certain designated objects in videos. One of the main challenges in this task is the existence of background distractors that appear similar to the target objects. We propose three novel strategies to suppress such distractors: 1) a spatio-temporally diversified template construction scheme to obtain generalized properties of the target objects; 2) a learnable distance-scoring function to exclude spatially-distant distractors by exploiting t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 26 publications
0
6
0
Order By: Relevance
“…We quantitatively compare our method to existing twostage methods in Table 1. For the VOS models, we adopt FRTM (Robinson et al 2020), CFBI (Yang, Wei, and Yang 2020), BMVOS (Cho et al 2022a), and TBD (Cho et al 2022b). For the VI models, CPNet (Lee et al 2019), STTN (Zeng, Fu, and Chao 2020), FGVC (Gao et al 2020), and FuseFormer (Liu et al 2021) are used.…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We quantitatively compare our method to existing twostage methods in Table 1. For the VOS models, we adopt FRTM (Robinson et al 2020), CFBI (Yang, Wei, and Yang 2020), BMVOS (Cho et al 2022a), and TBD (Cho et al 2022b). For the VI models, CPNet (Lee et al 2019), STTN (Zeng, Fu, and Chao 2020), FGVC (Gao et al 2020), and FuseFormer (Liu et al 2021) are used.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…We adopt two datasets for network training, COCO (Lin et al 2014) and YouTube-VOS 2018 (Xu et al 2018). As COCO is an image dataset, we randomly augment each image to generate videos, following the protocol in STM (Oh et al 2019a) and TBD (Cho et al 2022b). We resize all training videos to a 240 × 432 resolution and use them as clean videos.…”
Section: Network Trainingmentioning
confidence: 99%
“…Considering the lack of details when only employing high-level feature matching, HMMN [36] proposes a novel hierarchical matching mechanism to capture small objects as well. To relieve potential errors that can be caused by employing a pixel-level template, AOC [46] employs an adaptive proxylevel template, and TBD [6] employs both pixel-level and object-level templates simultaneously.…”
Section: Related Workmentioning
confidence: 99%
“…Decoder. The decoder is designed identically to TBD [6]. It consists of convolutional layers that fuse and refine different features, and deconvolutional layers [49] that upscale the refined features.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation