Weakly Supervised Semantic Segmentation using Web-Crawled Videos

Kim

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

et al. 2019

453

359

The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. Fick-leNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.

Section: Training Val Testmentioning

confidence: 98%

“…Supervision: Image-level and additional annotations MIL-seg CVPR '15 [23] 700K 42.0 40.6 STC TPAMI '17 [32] 50K 49.8 51.2 TransferNet CVPR '16 [9] 70K 52.1 51.2 CrawlSeg CVPR '17 [10] 970K 58.1 58.7 AISI ECCV '18 [11] 11K 61.3 62.1…”

Section: Training Val Testmentioning

confidence: 99%

FickleNet: Weakly and Semi-Supervised Semantic Image Segmentation Using Stochastic Inference

Kim

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

et al. 2019

453

359

2017 IEEE International Conference on Computer Vision (ICCV)

“…In this context [16,50] work in the even more constrained scenario, where only two classes are considered: foreground vs. background. By contrast, to differentiate multiple foreground classes, but still assuming a single background, [35] relied on motion cues and [17] made use of a huge amount of web-crawled data (4606 videos with 960,517 frames).…”

Section: Related Workmentioning

confidence: 99%

Bringing Background into the Foreground: Making All Classes Equal in Weakly-Supervised Video Semantic Segmentation

Saleh

Aliakbarian

Salzmann³

et al. 2017

Pixel-level annotations are expensive and timeconsuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recent years have seen great progress in weakly-supervised semantic segmentation, whether from a single image or from videos. However, most existing methods are designed to handle a single background class. In practical applications, such as autonomous navigation, it is often crucial to reason about multiple background classes. In this paper, we introduce an approach to doing so by making use of classifier heatmaps. We then develop a two-stream deep architecture that jointly leverages appearance and motion, and design a loss based on our heatmaps to train it. Our experiments demonstrate the benefits of our classifier heatmaps and of our two-stream architecture on challenging urban scene datasets and on the YouTube-Objects benchmark, where we obtain state-of-the-art results.

IEEE Trans. on Image Process.

“…Weak Supervision. Weakly supervised learning has been extensively used for various problems in computer vision such as semantic segmentation [73,74,75,76,77,78], object localization [79,80,81,82], saliency detection [83,84], scene recognition [85,86] and many more. However, this form of learning has been relatively unexplored for crowd counting.…”

Section: Related Workmentioning

confidence: 99%

HA-CCN: Hierarchical Attention-Based Crowd Counting Network

Sindagi

Patel

2020

184

Single image-based crowd counting has recently witnessed increased focus, but many leading methods are far from optimal, especially in highly congested scenes. In this paper, we present Hierarchical Attention-based Crowd Counting Network (HA-CCN) that employs attention mechanisms at various levels to selectively enhance the features of the network. The proposed method, which is based on the VGG16 network, consists of a spatial attention module (SAM) and a set of global attention modules (GAM). SAM enhances low-level features in the network by infusing spatial segmentation information, whereas the GAM focuses on enhancing channel-wise information in the higher level layers. The proposed method is a single-step training framework, simple to implement and achieves state-of-the-art results on different datasets. Furthermore, we extend the proposed counting network by introducing a novel set-up to adapt the network to different scenes and datasets via weak supervision using image-level labels. This new set up reduces the burden of acquiring labour intensive point-wise annotations for new datasets while improving the cross-dataset performance.