Black-box Explanation of Object Detectors via Saliency Maps

Petsiuk, Vitali; Jain, Rajiv; Manjunatha, Varun; Morariu, Vlad I.; Mehra, Ashutosh; Ordóñez, Vicente; Saenko, Kate

doi:10.1109/cvpr46437.2021.01128

Cited by 147 publications

(208 citation statements)

References 37 publications

Supporting

Mentioning

138

Contrasting

Order By: Relevance

“…The size-constraint approach in (Kervadec et al, 2019b) achieved similar low performance because the use of the presence and non-presence constraints does not impose any upper bound on the size of the regions of interest, which, typically, results in the activation of large regions. (Wei et al, 2017) 7.50 65.60 25.01 CAM-Max (Oquab et al, 2015) 1.25 66.00 26.32 CAM-LSE (Pinheiro and Collobert, 2015;Sun et al, 2016) 1.25 66.05 27.93 Grad-CAM (Selvaraju et al, 2017) 0.00 66.30 21.30 CAM-Avg (Zhou et al, 2016) 0.00 66.90 17.88 Wildcat (Durand et al, 2017) 1 In terms of image classification performance (Table 1, first column), the proposed method obtains the lowest classification error, similarly to other methods such as CAM-avg (Zhou et al, 2016) and Grad-CAM (Selvaraju et al, 2017). It is noteworthy to mention that, despite providing the best segmentation results, U-Net cannot simultaneously provide image classification predictions.…”

Section: Resultsmentioning

confidence: 99%

“…Pinpointing image sub-regions that were used by the model to make its global imageclass prediction not only provides weakly supervised segmentation, but also enables interpretable deep-network classifiers. It is worth noting that such interpretability aspects are also attracting wide interest in computer vision (Bach et al, 2015;Bau et al, 2017;Bhatt et al, 2020;Dabkowski and Gal, 2017;Escalante et al, 2018;Fong et al, 2019;Fong and Vedaldi, 2017;Goh et al, 2020;Osman et al, 2020;Murdoch et al, 2019;Petsiuk et al, 2020;2018;Ribeiro et al, 2016;Samek et al, 2020;Zhang et al, 2020;Belharbi et al, 2021) and medical imaging (de La Torre et al, 2020;Gondal et al, 2017;González-Gonzalo et al, 2020;Taly et al, 2019;Quellec et al, 2017;Keel et al, 2019;Wang et al, 2017). Deep learning classifiers are often considered as "black boxes" due to the lack of explanatory factors in their decisions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images via Max-Min Uncertainty

Belharbi¹,

Rony²,

Dolz³

et al. 2020

Preprint

View full text Add to dashboard Cite

Weakly supervised learning (WSL) has recently triggered substantial interest as it mitigates the lack of pixel-wise annotations, while enabling interpretable models. Given global image labels, WSL methods yield pixel-level predictions (segmentations). Despite their recent success, mostly with natural images, such methods could be seriously challenged when the foreground and background regions have similar visual cues, yielding high false-positive rates in segmentations, as is the case of challenging histology images. WSL training is commonly driven by standard classification losses, which implicitly maximize model confidence and find the discriminative regions linked to classification decisions. Therefore, they lack mechanisms for modeling explicitly non-discriminative regions and reducing falsepositive rates. We propose new regularization terms, which enable the model to seek both non-discriminative and discriminative regions, while discouraging unbalanced segmentations. We introduce high uncertainty as a criterion to localize non-discriminative regions that do not affect classifier decision, and describe it with original Kullback-Leibler (KL) divergence losses evaluating the deviation of posterior predictions from the uniform distribution. Our KL terms encourage high uncertainty of the model when the latter takes the latent non-discriminative regions as input. Our loss integrates: (i) a cross-entropy seeking a foreground, where model confidence about class prediction is high; (ii) a KL regularizer seeking a background, where model uncertainty is high; and (iii) log-barrier terms discouraging unbalanced segmentations. Comprehensive experiments and ablation studies over the public GlaS colon cancer data show substantial improvements over state-of-the-art WSL methods, and confirm the effect of our new regularizers. Our code is publicly available.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images via Max-Min Uncertainty

Belharbi¹,

Rony²,

Dolz³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…RISE [24] averages random binary masks according to the model's output class probability for the masked inputs. This is extended in D-RISE [25] by a similarity metric allowing its application to detection models as well.…”

Section: Related Workmentioning

confidence: 99%

“…Inspired by perturbation approaches to generate saliency maps for image-based black-box models [24,25,56], we leverage the principle of analysis by occlusion. We propose OccAM: Occlusion-based Attribution Maps for 3D object detectors on LiDAR data.…”

Section: Introductionmentioning

confidence: 99%

OccAM's Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data

Schinagl¹,

Krispel²,

Possegger³

et al. 2022

Preprint

View full text Add to dashboard Cite

While 3D object detection in LiDAR point clouds is wellestablished in academia and industry, the explainability of these models is a largely unexplored field. In this paper, we propose a method to generate attribution maps for the detected objects in order to better understand the behavior of such models. These maps indicate the importance of each 3D point in predicting the specific objects. Our method works with black-box models: We do not require any prior knowledge of the architecture nor access to the model's internals, like parameters, activations or gradients. Our efficient perturbation-based approach empirically estimates the importance of each point by testing the model with randomly generated subsets of the input point cloud. Our sub-sampling strategy takes into account the special characteristics of LiDAR data, such as the depth-dependent point density. We show a detailed evaluation of the attribution maps and demonstrate that they are interpretable and highly informative. Furthermore, we compare the attribution maps of recent 3D object detection architectures to provide insights into their decision-making processes. Car Car Pedestrian (Wall) Car Cyclist Car

show abstract

“…Thirdly, the bias present in pre-trained models may propagate into the target task leading to an inadvertently biased target model. The deep networks exhibit different types of biases due to factors such as background, color, racial (Gwilliam et al (2021)), gender (Tang et al (2021); Zhao et al (2017)), contextual (Singh et al (2020)), co-occurrence (Petsiuk et al (2021)), spatial noise, dataset (Tommasi et al (2017)) and object-size (Nguyen et al (2020)). For instance, Petsiuk et al (2021) show that the object detectors are vulnerable to learning the co-occurrence of an unrelated adversarial marker.…”

Section: Introductionmentioning

confidence: 99%

Co-Segmentation Inspired Attention Networks for Video-Based Person Re-Identification

Subramaniam

Nambiar

Mittal

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

111

View full text Add to dashboard Cite

Video based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Though using pre-trained models seems to be a viable approach, it is infeasible in practice due to the need for exhaustive annotation of object categories, domain gap between datasets and bias present in pre-trained models. To overcome these downsides, we propose to utilize the common rationale that a sequence of video frames capture a set of common objects and interactions between them, thus a notion of co-segmentation between the video frame features may equip the model with the ability to automatically focus on salient regions and improve underlying task's performance in an end-to-end manner. In this regard, we propose a generic module called "Co-Segmentation Activation Module" (COSAM) that can be plugged-in to any CNN to promote the notion of co-segmentation based attention among a sequence of video frame features. We show the application of COSAM in three video based tasks namely: 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification and demonstrate that COSAM is able to capture salient regions in the video frames, thus leading to notable performance improvements along with interpretable attention maps.

show abstract

Black-box Explanation of Object Detectors via Saliency Maps

Cited by 147 publications

References 37 publications

Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images via Max-Min Uncertainty

Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images via Max-Min Uncertainty

OccAM's Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data

Co-Segmentation Inspired Attention Networks for Video-Based Person Re-Identification

Contact Info

Product

Resources

About