Background Activation Suppression for Weakly Supervised Object Localization

Wu, Ping-Yu; Zhai, Wei; Cao, Yang

doi:10.1109/cvpr52688.2022.01385

Cited by 30 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More concretely, if the backbone is VGG16 [45], KD-CI-CAM achieves 79.2% Top-1 classification accuracy that is 1.9% higher than the current SOTA FAM [32] and outperforms it by 3.7% and 2.3% in the Top-1 localization accuracy and GT-known localization accuracy, respectively. Besides, KD-CI-CAM reaches 73.0% Top-1 localization accuracy that is 1.7% higher than the current SOTA BAS [56] and outperforms it by 0.5% in the GT-known localization accuracy. Compared with the GTknown localization SOTA BridgeGap [22], KD-CI-CAM is in a narrow margin that 1.6% lower for the GT-known localization accuracy, but it brings a significant performance gain of 2.2% over BridgeGap [22] in the Top-1 localization accuracy.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 82%

“…We summarise this issue as a classification-localization dilemma ("C-L dilemma" for short). We argue that these two problems severely hinder the WSOL performance and heretofore yet to be well studied, despite the existence of a vast body of WSOL literature [10,31,52,56,67,69,72].…”

Section: Introductionmentioning

confidence: 94%

“…The hyper-parameters of classification and localization teachers are consistent with the student model except for µ = 0.6, η = 0.0, β = 0.2, and δ = 2e − 08. In the testing phase, we first resize images to 500 × 500 and then centrally crop it to 299 × 299 inspired by [52,56,65]. Then, we generate the bounding box by segmenting the localization map using a threshold θ = 0.21.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Forcing a classification model to pay more attention to the unrepresentative area (e.g., the fur of an animal) conceding to the integral contour perception would inevitably cause a biased categorical prediction, and vise versa for a localization model. Prior approaches sidestep such a dilemma by simply trading off the classification and localization performances, i.e., choosing a mutually acceptable model or only reporting localization results [22,35,56,59]. However, the intrinsic issue behind this dilemma remains under-explored.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improving Weakly Supervised Object Localization via Causal Intervention

Shao

Luo

Zhang³

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

The recently emerged weakly-supervised object localization (WSOL) methods can learn to localize an object in the image only using image-level labels. Previous works endeavor to perceive the interval objects from the small and sparse discriminative attention map, yet ignoring the co-occurrence confounder (e.g., duck and water), which makes the model inspection (e.g., CAM) hard to distinguish between the object and context. In this paper, we make an early attempt to tackle this challenge via causal intervention (CI). Our proposed method, dubbed CI-CAM, explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps thus improving the accuracy of object localization. Extensive experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning the clear object boundary from confounding contexts. Particularly, on the CUB-200-2011 which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms the traditional CAM-based baseline (58.39% vs 52.4% in Top-1 localization accuracy). While in more general scenarios such as ILSVRC 2016, CI-CAM can also perform on par with the state of the arts.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 82%

Section: Introductionmentioning

confidence: 94%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Weakly Supervised Object Localization via Causal Intervention

Shao

Luo

Zhang³

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…To relieve the laborious annotations, extensive efforts have been made to address semantic segmentation with less supervision. In the family of weakly-supervised object localization [24,49,53] and semantic segmentation [1,9,51], only class labels are available for supervision. Generally, the class activation maps [57] derived from the classification network serve as the initial segmentation results.…”

Section: Related Workmentioning

confidence: 99%

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Xu¹,

Hou²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper, we consider the problem of openvocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits webcrawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.

show abstract