Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

Kim

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

et al. 2019

453

359

The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. Fick-leNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.

Section: Comparison To the State Of The Artmentioning

confidence: 99%

FickleNet: Weakly and Semi-Supervised Semantic Image Segmentation Using Stochastic Inference

Kim

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

et al. 2019

453

359

“…A group of approaches take the class activation maps (CAMs) [11] generated from classification networks as initial seeds. Since CAMs only focus on small discriminative regions which are too sparse to effectively supervise a segmentation model, various techniques such as adversarial erasing [12], [17], [21], [18] and region growing [13], [22] have been developed to expand sparse object seeds. Another research line introduces dilated convolutions of different rates [14], [16], [15], [23] to enlarge receptive fields in classification networks and aggregate multiple attention maps to achieve dense localization cues.…”

Section: A Weakly-supervised Semantic Segmentationmentioning

confidence: 99%

“…• In contrast to existing WSSS methods [18], [19], [20] that directly combine class-agnostic saliency maps with class-specific attention maps in user-defined ways, our approach fuses these two cues adaptively via the learning of the proposed self-attention network.…”

Section: Introductionmentioning

confidence: 99%

Saliency Guided Self-Attention Network for Weakly and Semi-Supervised Semantic Segmentation

Yao

Gong

2020

IEEE Access

Weakly supervised semantic segmentation (WSSS) using only image-level labels can greatly reduce the annotation cost and therefore has attracted considerable research interest. However, its performance is still inferior to the fully supervised counterparts. To mitigate the performance gap, we propose a saliency guided self-attention network (SGAN) to address the WSSS problem. The introduced self-attention mechanism is able to capture rich and extensive contextual information but also may mis-spread attentions to unexpected regions. To enable this mechanism work effectively under weak supervision, we integrate class-agnostic saliency priors into the self-attention mechanism to prevent the attentions on discriminative parts from misspreading to the background. And meanwhile we utilize classspecific attention cues as an additional supervision for SGAN, which reduces the mis-spread of attentions in regions belonging to different foreground categories. The proposed approach is able to produce dense and accurate localization cues, by which the segmentation performance is boosted. Experiments on PASCAL VOC 2012 dataset show that the proposed approach outperforms all other state-of-the-art methods.Abstract-Weakly supervised semantic segmentation (WSSS) using only image-level labels can greatly reduce the annotation cost and therefore has attracted considerable research interest. However, its performance is still inferior to the fully supervised counterparts. To mitigate the performance gap, we propose a saliency guided self-attention network (SGAN) to address the WSSS problem. The introduced self-attention mechanism is able to capture rich and extensive contextual information but also may mis-spread attentions to unexpected regions. To enable this mechanism work effectively under weak supervision, we integrate class-agnostic saliency priors into the self-attention mechanism to prevent the attentions on discriminative parts from misspreading to the background. And meanwhile we utilize classspecific attention cues as an additional supervision for SGAN, which reduces the mis-spread of attentions in regions belonging to different foreground categories. The proposed approach is able to produce dense and accurate localization cues, by which the segmentation performance is boosted. Experiments on PASCAL VOC 2012 dataset show that the proposed approach outperforms all other state-of-the-art methods.

“…Our method achieves mIoU values of 63.9 [45] 49.8 51.2 TransferNet CVPR '16 [11] 52.1 51.2 AISI ECCV '18 [16] 61.3 62. [33] 52.8 53.7 TPL ICCV '17 [22] 53.1 53.8 AE_PSL CVPR '17 [44] 55.0 55.7 DCSP BMVC '17 [2] 58.6 59.2 MEFF CVPR '18 [9] -55.6 GAIN CVPR '18 [26] 55.3 56.8 MCOF CVPR '18 [43] 56.2 57.6 AffinityNet CVPR '18 [1] 58.4 60.5 DSRG CVPR '18 [17] 59.0 60.4 MDC CVPR '18 [46] 60.4 60.8 SeeNet NIPS '18 [15] 61.1 60.7 FickleNet CVPR '19 [24] 61.2 61.9 Ours 63.9 65.0 and 65.0 for PASCAL VOC 2012 validation and test images respectively, which is 94.4% of that of DeepLab [3], trained with fully annotated data, which achieved an mIoU of 67.6 on validation images. Our method is 3.1% better on test images than the best method which uses only image-level annotations for supervision.…”

Section: Results On Image Segmentationmentioning

confidence: 99%

Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation

Kim

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

et al. 2019

When a deep neural network is trained on data with only image-level labeling, the regions activated in each image tend to identify only a small region of the target object. We propose a method of using videos automatically harvested from the web to identify a larger region of the target object by using temporal information, which is not present in the static image. The temporal variations in a video allow different regions of the target object to be activated. We obtain an activated region in each frame of a video, and then aggregate the regions from successive frames into a single image, using a warping technique based on optical flow. The resulting localization maps cover more of the target object, and can then be used as proxy ground-truth to train a segmentation network. This simple approach outperforms existing methods under the same level of supervision, and even approaches relying on extra annotations. Based on VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4, respectively, on PASCAL VOC 2012 test images, which represents a new state-of-the-art.