Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

Chen, Yuhua; Pont-Tuset, Jordi; Montes, Alberto; Gool, Luc Van

doi:10.1109/cvpr.2018.00130

Cited by 289 publications

(233 citation statements)

References 51 publications

Supporting

Mentioning

233

Contrasting

Order By: Relevance

“…It has created a large number of synthetic video training data from Pascal VOC [11,12], ECSSD [49] and MSRA10K [7] DAVIS 2017 benchmark, we exclude PReMVOS [38] and OSVOS+ [39] as they both use multiple specialized networks in multiple processes to refine their results. For DAVIS 2016, we compare with OnAVOS [52], FAVOS [5], OSVOS [3], MSK [42], PML [4], SFL [6], OSMN [57], CTN [27] and VPN [26]. We detect multiple objects and evaluate in the way for single-object.…”

Section: Compare With Other Methodsmentioning

confidence: 99%

LIP: Learning Instance Propagation for Video Object Segmentation

Lyu

Vosselman

Xia

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

In recent years, the task of segmenting foreground objects from background in a video, i.e. video object segmentation (VOS), has received considerable attention. In this paper, we propose a single end-to-end trainable deep neural network, convolutional gated recurrent Mask-RCNN, for tackling the semi-supervised VOS task. We take advantage of both the instance segmentation network (Mask-RCNN) and the visual memory module (Conv-GRU) to tackle the VOS task. The instance segmentation network predicts masks for instances, while the visual memory module learns to selectively propagate information for multiple instances simultaneously, which handles the appearance change, the variation of scale and pose and the occlusions between objects. After offline and online training under purely instance segmentation losses, our approach is able to achieve satisfactory results without any post-processing or synthetic video data augmentation. Experimental results on DAVIS 2016 dataset and DAVIS 2017 dataset have demonstrated the effectiveness of our method for video object segmentation task.

show abstract

Section: Compare With Other Methodsmentioning

confidence: 99%

LIP: Learning Instance Propagation for Video Object Segmentation

Lyu

Vosselman

Xia

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

show abstract

“…These faster semi-supervised approaches come in many flavours. For instance, Chen et al [7] learn a metric space for pixel embeddings, which is then used to establish associations between pixels across frames, while Cheng et al [8] suggest to individually track object parts from the first frame with a visual object tracker [2] and then aggregate them according to their similarity with the initialisation mask.…”

Section: Related Workmentioning

confidence: 99%

“…PReMVOS [40] 84.9 88.6 -OSVOS [3] 79.8 80.6 -MSK [50] 79.7 75.4 -PML [7] 75.5 79.3 -SFL [9] 76.1 76.0 -VPN [52] 70. 2 pruning.…”

Section: Comparison With the State Of The Artmentioning

confidence: 99%

“…In the video object segmentation community, the deterioration of performance over time in unsupervised VOS methods based on optical flow or RNNs is well known and has been widely discussed [7,33,46,62]. For instance, Li et al [33] demonstrate that, as a regular optical flowbased model progresses through frames, foreground embeddings become increasingly closer in feature space to the first frame's background as opposed to the foreground.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Anchor Diffusion for Unsupervised Video Object Segmentation

Zhao

Wang

Bertinetto³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

121

View full text Add to dashboard Cite

Unsupervised video object segmentation has often been tackled by methods based on recurrent neural networks and optical flow. Despite their complexity, these kinds of approaches tend to favour short-term temporal dependencies and are thus prone to accumulating inaccuracies, which cause drift over time. Moreover, simple (static) image segmentation models, alone, can perform competitively against these methods, which further suggests that the way temporal dependencies are modelled should be reconsidered. Motivated by these observations, in this paper we explore simple yet effective strategies to model long-term temporal dependencies. Inspired by the non-local operators of [70], we introduce a technique to establish dense correspondences between pixel embeddings of a reference "anchor" frame and the current one. This allows the learning of pairwise dependencies at arbitrarily long distances without conditioning on intermediate frames. Without online supervision, our approach can suppress the background and precisely segment the foreground object even in challenging scenarios, while maintaining consistent performance over time. With a mean IoU of 81.7%, our method ranks first on the DAVIS-2016 leaderboard of unsupervised methods, while still being competitive against state-of-the-art online semisupervised approaches. We further evaluate our method on the FBMS dataset and the ViSal video saliency dataset, showing results competitive with the state of the art.

show abstract

“…Matching or propagation based methods have also been proposed for VOS. Matching based methods [8,19] segment pixels according to the pixel-level matching scores between the features of the first frame and of each subsequent frame ( Fig. 1 (a)), while propagation based methods [9,10,38,40,54,59] mainly rely on temporally deforming the annotated mask of the first frame via predictions of the previous frame [40] ( Fig.…”

Section: Introductionmentioning

confidence: 99%

RANet: Ranking Attention Network for Fast Video Object Segmentation

Wang¹,

Xu²,

Liu³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

225

101

View full text Add to dashboard Cite

Despite online learning (OL) techniques have boosted the performance of semi-supervised video object segmentation (VOS) methods, the huge time costs of OL greatly restrict their practicality. Matching based and propagation based methods run at a faster speed by avoiding OL techniques. However, they are limited by sub-optimal accuracy, due to mismatching and drifting problems. In this paper, we develop a real-time yet very accurate Ranking Attention Network (RANet) for VOS. Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixellevel similarity and segmentation in an end-to-end manner. To better utilize the similarity maps, we propose a novel ranking attention module, which automatically ranks and selects these maps for fine-grained VOS performance. Experiments on DAVIS 16 and DAVIS 17 datasets show that our RANet achieves the best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and J &F=85.5% on DAVIS 16 . With OL, our RANet reaches J &F=87.1% on DAVIS 16

show abstract

Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

Cited by 289 publications

References 51 publications

LIP: Learning Instance Propagation for Video Object Segmentation

LIP: Learning Instance Propagation for Video Object Segmentation

Anchor Diffusion for Unsupervised Video Object Segmentation

RANet: Ranking Attention Network for Fast Video Object Segmentation

Contact Info

Product

Resources

About