This work addresses the task of camera localization in a known 3D scene given a single input RGB image. State-of-the-art approaches accomplish this in two steps: firstly, regressing for every pixel in the image its 3D scene coordinate and subsequently, using these coordinates to estimate the final 6D camera pose via RANSAC. To solve the first step, Random Forests (RFs) are typically used. On the other hand, Neural Networks (NNs) reign in many dense regression tasks, but are not test-time efficient. We ask the question: which of the two is best for camera localization? To address this, we make two method contributions: (1) a test-time efficient NN architecture which we term a ForestNet that is derived and initialized from a RF, and (2) a new fully-differentiable robust averaging technique for regression ensembles which can be trained endto-end with a NN. Our experimental findings show that for scene coordinate regression, traditional NN architectures are superior to test-time efficient RFs and ForestNets, however, this does not translate to final 6D camera pose accuracy where RFs and ForestNets perform slightly better. To summarize, our best method, a ForestNet with a robust average, which has an equivalent fast and lightweight RF, improves over the stateof-the-art for camera localization on the 7-Scenes dataset [1]. While this work focuses on scene coordinate regression for camera localization, our innovations may also be applied to other continuous regression tasks.
We consider the task of learning a classifier for semantic segmentation using weak supervision in the form of image labels which specify the object classes present in the image. Our method uses deep convolutional neural networks (CNNs) and adopts an Expectation-Maximization (EM) based approach. We focus on the following three aspects of EM: (i) initialization; (ii) latent posterior estimation (E-step) and (iii) the parameter update (M-step). We show that saliency and attention maps, our bottom-up and top-down cues respectively, of simple images provide very good cues to learn an initialization for the EM-based algorithm. Intuitively, we show that before trying to learn to segment complex images, it is much easier and highly effective to first learn to segment a set of simple images and then move towards the complex ones. Next, in order to update the parameters, we propose minimizing the combination of the standard softmax loss and the KL divergence between the true latent posterior and the likelihood given by the CNN. We argue that this combination is more robust to wrong predictions made by the expectation step of the EM method. We support this argument with empirical and visual results. Extensive experiments and discussions show that: (i) our method is very simple and intuitive; (ii) requires only image-level labels; and (iii) consistently outperforms other weakly-supervised state-of-the-art methods with a very high margin on the PASCAL VOC 2012 dataset.
Sighted people predominantly use vision to navigate spaces, and sight loss has negative consequences for independent navigation and mobility. The recent proliferation of devices that can extract 3D spatial information from visual scenes opens up the possibility of using such mobility-relevant information to assist blind and visually impaired people by presenting this information through modalities other than vision. In this work, we present two new methods for encoding visual scenes using spatial audio: simulated echolocation and distance-dependent hum volume modulation. We implemented both methods in a virtual reality (VR) environment and tested them using a 3D motion-tracking device. This allowed participants to physically walk through virtual mobility scenarios, generating data on real locomotion behaviour. Blindfolded sighted participants completed two tasks: maze navigation and obstacle avoidance. Results were measured against a visual baseline in which participants performed the same two tasks without blindfolds. Task completion time, speed and number of collisions were used as indicators of successful navigation, with additional metrics exploring detailed dynamics of performance. In both tasks, participants were able to navigate using only audio information after minimal instruction. While participants were 65% slower using audio compared to the visual baseline, they reduced their audio navigation time by an average 21% over just 6 trials. Hum volume modulation proved over 20% faster than simulated echolocation in both mobility scenarios, and participants also showed the greatest improvement with this sonification method. Nevertheless, we do speculate that simulated echolocation remains worth exploring as it provides more spatial detail and could therefore be more useful in more complex environments. The fact that participants were intuitively able to successfully navigate space with two new visual-to-audio mappings for conveying spatial information motivates the further exploration of these and other mappings with the goal of assisting blind and visually impaired individuals with independent mobility.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.