Hongkai Wen scite author profile

This paper studies monocular visual odometry (VO) problem. Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching, motion estimation, local optimisation, etc. Although some of them have demonstrated superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior knowledge is also required to recover an absolute scale for monocular VO. This paper presents a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs) 1 . Since it is trained and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset show competitive performance to state-ofthe-art methods, verifying that the end-to-end Deep Learning technique can be a viable complement to the traditional VO systems.

show abstract

VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization

Clark

et al. 2017

View full text Add to dashboard Cite

Machine learning techniques, namely convolutional neural networks (CNN) and regression forests, have recently shown great promise in performing 6-DoF localization of monocular images. However, in most cases imagesequences, rather only single images, are readily available. To this extent, none of the proposed learning-based approaches exploit the valuable constraint of temporal smoothness, often leading to situations where the per-frame error is larger than the camera motion. In this paper we propose a recurrent model for performing 6-DoF localization of video-clips. We find that, even by considering only short sequences (20 frames), the pose estimates are smoothed and the localization error can be drastically reduced. Finally, we consider means of obtaining probabilistic pose estimates from our model. We evaluate our method on openly-available real-world autonomous driving and indoor localization datasets.

show abstract

End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks

Wang

Clark

Wen

et al. 2017

The International Journal of Robotics Research

210

178

View full text Add to dashboard Cite

This paper studies visual odometry (VO) from the perspective of deep learning. After tremendous efforts in the robotics and computer vision communities over the past few decades, state-of-the-art VO algorithms have demonstrated incredible performance. However, since the VO problem is typically formulated as a pure geometric problem, one of the key features still missing from current VO systems is the capability to automatically gain knowledge and improve performance through learning. In this paper, we investigate whether deep neural networks can be effective and beneficial to the VO problem. An end-to-end, sequence-to-sequence probabilistic visual odometry (ESP-VO) framework is proposed for the monocular VO based on deep recurrent convolutional neural networks. It is trained and deployed in an end-to-end manner, that is, directly inferring poses and uncertainties from a sequence of raw images (video) without adopting any modules from the conventional VO pipeline. It can not only automatically learn effective feature representation encapsulating geometric information through convolutional neural networks, but also implicitly model sequential dynamics and relation for VO using deep recurrent neural networks. Uncertainty is also derived along with the VO estimation without introducing much extra computation. Extensive experiments on several datasets representing driving, flying and walking scenarios show competitive performance of the proposed ESP-VO to the state-of-the-art methods, demonstrating a promising potential of the deep learning technique for VO and verifying that it can be a viable complement to current VO systems.

show abstract

3D Object Reconstruction from a Single Depth View with Adversarial Learning

Yang

Wen

Wang³

et al. 2017

176

125

View full text Add to dashboard Cite

Recent advancements in deep learning opened new opportunities for learning a high-quality 3D model from a single 2D image given sufficient training on large-scale data sets. However, the significant imbalance between available amount of images and 3D models, and the limited availability of labeled 2D image data (i.e. manually annotated pairs between images and their corresponding 3D models), severely impacts the training of most supervised deep learning methods in practice. In this paper, driven by a novel design of adversarial networks, we have developed an unsupervised learning paradigm to reconstruct 3D models from a single 2D image, which is free of manually annotated pairwise input image and its associated 3D model. Particularly, the paradigm begins with training an adaption network via autoencoder with adversarial loss, which embeds unpaired 2D synthesized image domain with real world image domain to a shared latent vector space. Then, we jointly train a 3D deconvolutional network to transform the latent vector space to the 3D object space together with the embedding process. Our experiments verify our network's robust and superior performance in handling 3D volumetric object generation from a single 2D image.Existing works on 3D object reconstruction from 2D image(s) can be broadly categorized as two of the following: traditional methods without learning; deep learning based methods. 3D reconstruction without learning. The majority of traditional reconstruction methods based on SFM or SLAM [1,2] are subject to a dense number of views, and most of them rely on the hypothesis that features can be matched across views. 2D to 3D reconstruction models such as multi-view stereo [9, 10], space carving [11], multiple moving object and large scale structure from motion [3][4][5], have all demonstrated good performance in solving the 2D to 3D reconstruction problem. However these methods require high calibrated cameras and segmentation of objects from their background, which are less applicable in practice. Deep Neural Networks in 3D visual computing. Nowadays, by generating 3D volumetric data [12], prominent deep learning models such as the deep 2D convolutional neural networks can be naturally extended to learn 3D objects. Deep learning models have proven to have strong capabilities in learning latent representative vector space of 3D objects [12]. Multi-View CNN, Conv-DAE, Voxnet, Gift, T-L embedding, 3DGAN and so on, have uncovered great potential for solving retrieval, classification, 3D reconstruction problem, etc. on [13][14][15][16][17][18].In contrast to the vast amount of research and accomplishments in the field of 3D object classification and retrieval, there are fewer research and far less accomplished results on 3D object reconstruction. Recently, researchers began to utilize 3D deconvolutional neural network to generate 3D volumetric objects from 2D images, for instance, 3D- GAN[18] and T-L embedding [17] strive to learn a latent vector space representation of 2D images, and then transform it to gene...

show abstract

Dense 3D Object Reconstruction from a Single Depth View

Yang

Rosa

Markham

et al. 2019

IEEE Trans. Pattern Anal. Mach. Intell.

129

113

View full text Add to dashboard Cite

In this paper, we propose a novel approach, 3D-RecGAN++, which reconstructs the complete 3D structure of a given object from a single arbitrary depth view using generative adversarial networks. Unlike existing work which typically requires multiple views of the same object or class labels to recover the full 3D geometry, the proposed 3D-RecGAN++ only takes the voxel grid representation of a depth view of the object as input, and is able to generate the complete 3D occupancy grid with a high resolution of $256^3$ by recovering the occluded/missing regions. The key idea is to combine the generative capabilities of 3D encoder-decoder and the conditional adversarial networks framework, to infer accurate and fine-grained 3D structures of objects in high-dimensional voxel space. Extensive experiments on large synthetic datasets and real-world Kinect datasets show that the proposed 3D-RecGAN++ significantly outperforms the state of the art in single view 3D object reconstruction, and is able to reconstruct unseen types of objects.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hongkai Wen

DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks

VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization

End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks

3D Object Reconstruction from a Single Depth View with Adversarial Learning

Dense 3D Object Reconstruction from a Single Depth View

Contact Info

Product

Resources

About