In this paper, we propose a novel self-supervised learning model for estimating continuous ego-motion from video. Our model learns to estimate camera motion by watching RGBD or RGB video streams and determining translational and rotation velocities that correctly predict the appearance of future frames. Our approach differs from other recent work on self-supervised structure-from-motion in its use of a continuous motion formulation and representation of rigid motion fields rather than direct prediction of camera parameters. To make estimation robust in dynamic environments with multiple moving objects, we introduce a simple two-component segmentation process that isolates the rigid background environment from dynamic scene elements. We demonstrate state-of-the-art accuracy of the self-trained model on several benchmark ego-motion datasets and highlight the ability of the model to provide superior rotational accuracy and handling of non-rigid scene motions.
We present an unconventional image super-resolution algorithm targeting focal stack images. Contrary to previous works, which align multiple images with sub-pixel accuracy for image super-resolution, we analyze the correlation among the differently focused narrow depth-of-field images in a focal stack to infer high-resolution details for image super-resolution. In order to accurately model the defocus kernels at different depths, we use a cubic interpolation to parameterize the projection of defocus kernels, and apply the radon transform to accurately reconstruct the defocus kernels at arbitrary depth. In the image super-resolution, we utilize the multi-image deconvolution method with a l1 -norm regularization to suppress noise and ringing artifacts. We have also extended the depth-of-field of our inputs to produce an all-in-focus super-resolution image. The effectiveness of our algorithm is demonstrated with the quantitative analysis using synthetic examples and the qualitative analysis using real-world examples.
Contextual information can have a substantial impact on the performance of visual tasks such as semantic segmentation, object detection, and geometric estimation. Data stored in Geographic Information Systems (GIS) offers a rich source of contextual information that has been largely untapped by computer vision. We propose to leverage such information for scene understanding by combining GIS resources with large sets of unorganized photographs using Structure from Motion (SfM) techniques. We present a pipeline to quickly generate strong 3D geometric priors from 2D GIS data using SfM models aligned with minimal user input. Given an image resectioned against this model, we generate robust predictions of depth, surface normals, and semantic labels. Despite the lack of detail in the model, we show that the precision of the predicted geometry is substantially more accurate than other single-image depth estimation methods. We then demonstrate the utility of these contextual constraints for re-scoring pedestrian detections, and use these GIS contextual features alongside object detection score maps to improve a CRF-based semantic segmentation framework, boosting accuracy over baseline models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.