This paper considers the problem of single image depth estimation. The employment of convolutional neural networks (CNNs) has recently brought about significant advancements in the research of this problem. However, most existing methods suffer from loss of spatial resolution in the estimated depth maps; a typical symptom is distorted and blurry reconstruction of object boundaries. In this paper, toward more accurate estimation with a focus on depth maps with higher spatial resolution, we propose two improvements to existing approaches. One is about the strategy of fusing features extracted at different scales, for which we propose an improved network architecture consisting of four modules: an encoder, decoder, multi-scale feature fusion module, and refinement module. The other is about loss functions for measuring inference errors used in training. We show that three loss terms, which measure errors in depth, gradients and surface normals, respectively, contribute to improvement of accuracy in an complementary fashion. Experimental results show that these two improvements enable to attain higher accuracy than the current state-of-the-arts, which is given by finer resolution reconstruction, for example, with small objects and object boundaries.
This paper proposes a method for detecting changes of a scene using a pair of its vehicular, omnidirectional images. The top images of figure 1 show an example of such image pairs taken at different times. Apparently, there are temporal differences in illumination and photographing conditions. Moreover, there has to exist visual difference in camera viewpoints, although they were captured from a vehicle running on the same street and were matched using GPS data. This is due to differences in vehicle paths and shutter timing. The type of scene changes targeted here includes 3D (e.g. vanishing/emergence of buildings, cars etc.) as well as 2D changes (e.g. changes of textures on building walls). To precisely detect these changes from such an image pair, it is necessary to overcome these unwanted visual differences.We tackle the change detection problem in the 2D domain. That is, we consider detecting changes based on the direct comparison of a pair of images. The major issue is then how to deal with the unwanted visual differences (i.e., viewpoint differences etc.) To cope with this, we propose to use the features extracted by convolutional neural networks (CNNs). To be specific, we use a fully trained CNN for large-scale object recognition task [4] in a transfer learning setting. It was reported in the literature that using activation of the upper layers of a CNN trained for a specific task can be reused for other visual classification tasks. Several recent researches imply that the upper layers of CNNs represent and encode highly-abstract information about the input image [1, 2, 5]. We conjecture that highly-abstract (or object-level) changes can be detected by using the upper layers, whereas low-level visual changes (e.g. edge, texture etc.) will be detected using the lower layers. We show that this conjecture is true through several experimental results.The proposed method consists of the three components: i) extraction of grid features, ii) superpixel segmentation, and iii) estimation of sky and ground areas by Geometric Context. These are described below.(i) Extraction of grid features We denote two input images by I t and I t , where t and t are the times at which they were captured. First, I t and I t are divided into grid cells g(= 1, ..., N g ). A feature is extracted from each grid cell g, yielding x g t and x g t . The changes that we want to detect are object-level changes (e.g, the emergence/vanishing of buildings and cars) and not low-level, appearance changes due to changes in viewpoints, illumination or photographing conditions. To distinguish these two, the proposed method uses the activation of a upper layer of a deep CNN for the grid features x g t and x g t . To be specific, we use a pooling layer of the CNN. Each feature (e.g., x g t ) is the activation of all the units in the same location across the maps of the pooling layer. Thus x g t has the same number of elements as the maps of the pooling layer.Next, these features are normalized so that |x g t | = 1, and then their dissimilarity is c...
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
This paper considers the problem of factorizing a matrix with missing components into a product of two smaller matrices, also known as principal component analysis with missing data (PCAMD). The Wiberg algorithm is a numerical algorithm developed for the problem in the community of applied mathematics. We argue that the algorithm has not been correctly understood in the computer vision community. Although there are many studies in our community, almost every one of which refers to the Wiberg study, as far as we know, there is no literature in which the performance of the Wiberg algorithm is investigated or the detail of the algorithm is presented. In this paper, we present derivation of the algorithm along with a problem in its implementation that needs to be carefully considered, and then examine its performance. The experimental results demonstrate that the Wiberg algorithm shows a considerably good performance, which should contradict the conventional view in our community, namely that minimization-based algorithms tend to fail to converge to a global minimum relatively frequently. The performance of the Wiberg algorithm is such that even starting with random initial values, it converges in most cases to a correct solution, even when the matrix has many missing components and the data are contaminated with very strong noise. Our conclusion is that the Wiberg algorithm can also be used as a standard algorithm for the problems of computer vision.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.