Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems. However, these deep models are perceived as "black box" methods considering the lack of understanding of their internal functioning. There has been a significant recent interest in developing explainable deep learning models, and this paper is an effort in this direction. Building on a recently proposed method called Grad-CAM, we propose a generalized method called Grad-CAM++ that can provide better visual explanations of CNN model predictions, in terms of better object localization as well as explaining occurrences of multiple object instances in a single image, when compared to state-of-the-art. We provide a mathematical derivation for the proposed method, which uses a weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights to generate a visual explanation for the corresponding class label. Our extensive experiments and evaluations, both subjective and objective, on standard datasets showed that Grad-CAM++ provides promising human-interpretable visual explanations for a given CNN architecture across multiple tasks including classification, image caption generation and 3D action recognition; as well as in new settings such as knowledge distillation.
Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: 'Open World Object Detection', where a model is tasked to: 1) identify objects that have not been introduced to it as 'unknown', without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-ofthe-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.
The estimation of head pose angle from face images is an integral component of face recognition systems, human computer interfaces and other human-centered computing applications. To determine the head pose, face images with varying pose angles can be considered to be lying on a smooth low-dimensional manifold in high-dimensional feature space. While manifold learning techniques capture the geometrical relationship between data points in the highdimensional image feature space, the pose label information of the training data samples are neglected in the computation of these embeddings. In this paper, we propose a novel supervised approach to manifold-based non-linear dimensionality reduction for head pose estimation. The Biased Manifold Embedding (BME) framework is pivoted on the ideology of using the pose angle information of the face images to compute a biased neighborhood of each point in the feature space, before determining the low-dimensional embedding. The proposed BME approach is formulated as an extensible framework, and validated with the Isomap, Locally Linear Embedding (LLE) and Laplacian Eigenmaps techniques. A Generalized Regression Neural Network (GRNN) is used to learn the non-linear mapping, and linear multi-variate regression is finally applied on the lowdimensional space to obtain the pose angle. We tested this approach on face images of 24 individuals with pose angles varying from −90• to +90• with a granularity of 2. The results showed substantial reduction in the error of pose angle estimation, and robustness to variations in feature spaces, dimensionality of embedding and other parameters.
Can we improve detection in the thermal domain by borrowing features from rich domains like visual RGB? In this paper, we propose a 'pseudo-multimodal' object detector trained on natural image domain data to help improve the performance of object detection in thermal images. We assume access to a large-scale dataset in the visual RGB domain and relatively smaller dataset (in terms of instances) in the thermal domain, as is common today. We propose the use of well-known image-to-image translation frameworks to generate pseudo-RGB equivalents of a given thermal image and then use a multi-modal architecture for object detection in the thermal image. We show that our framework outperforms existing benchmarks without the explicit need for paired training examples from the two domains. We also show that our framework has the ability to learn with less data from thermal domain when using our approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.