Unsupervised Learning of Depth and Ego-Motion from Video

Zhou, Tao; Brown, Matthew A.; Snavely, Noah; Lowe, David

doi:10.1109/cvpr.2017.700

Cited by 2,501 publications

(3,352 citation statements)

References 55 publications

Supporting

Mentioning

3,327

Contrasting

Unclassified

Order By: Relevance

“…All the networks are jointly optimized during training, and then they can be applied independently at test time. For instance, [20] learns depth and ego-motion from monocular video in an unsupervised way. The CNNs is trained by a photometric reconstruction loss, which was obtained by warping nearby views to the target using the computed depth and pose.…”

Section: Unsupervised Learning Methodsmentioning

confidence: 99%

DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints

Han¹,

Lin²,

Du³

et al. 2019

2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

View full text Add to dashboard Cite

This paper presents an self-supervised deep learning network for monocular visual inertial odometry (named DeepVIO). DeepVIO provides absolute trajectory estimation by directly merging 2D optical flow feature (OFF) and Inertial Measurement Unit (IMU) data. Specifically, it firstly estimates the depth and dense 3D point cloud of each scene by using stereo sequences, and then obtains 3D geometric constraints including 3D optical flow and 6-DoF pose as supervisory signals. Note that such 3D optical flow shows robustness and accuracy to dynamic objects and textureless environments. In DeepVIO training, 2D optical flow network is constrained by the projection of its corresponding 3D optical flow, and LSTMstyle IMU preintegration network and the fusion network are learned by minimizing the loss functions from ego-motion constraints. Furthermore, we employ an IMU status update scheme to improve IMU pose estimation through updating the additional gyroscope and accelerometer bias. The experimental results on KITTI and EuRoC datasets show that DeepVIO outperforms state-of-the-art learning based methods in terms of accuracy and data adaptability. Compared to the traditional methods, DeepVIO reduces the impacts of inaccurate Camera-IMU calibrations, unsynchronized and missing data.

show abstract

Section: Unsupervised Learning Methodsmentioning

confidence: 99%

DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints

Han¹,

Lin²,

Du³

et al. 2019

2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

View full text Add to dashboard Cite

show abstract

“…We also consider an implementation of a real dataset for fine tuning, using UAVs footages and either a preliminary thorough 3D offline scan or groundtruth-less techniques (Zhou et al, 2017). This would allow us to measure quantitative quality of our network for real footages and not only subjective as for now.…”

Section: Discussionmentioning

confidence: 99%

End-to-End Depth From Motion With Stabilized Monocular Videos

Pinard¹,

Chevalley²,

Manzanera³

et al. 2017

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

View full text Add to dashboard Cite

ABSTRACT:We propose a depth map inference system from monocular videos based on a novel dataset for navigation that mimics aerial footage from gimbal stabilized monocular camera in rigid scenes. Unlike most navigation datasets, the lack of rotation implies an easier structure from motion problem which can be leveraged for different kinds of tasks such as depth inference and obstacle avoidance. We also propose an architecture for end-to-end depth inference with a fully convolutional network. Results show that although tied to camera inner parameters, the problem is locally solvable and leads to good quality depth prediction.

show abstract

“…Godard et al [14] added the Left-Right consistency constraint to the loss function, exploiting another geometrical cue. Zhou et al [43] learned, in addition the ego-motion of the scene, and GeoNet [41] also used the optical flow of the scene. Wang et al [37] recently showed that using direct visual odometry along with depth normalization substantially improves performance on prediction.…”

Section: Related Workmentioning

confidence: 99%

Single Image Depth Estimation Trained via Depth From Defocus Cues

Gur

Wolf

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

110

View full text Add to dashboard Cite

Estimating depth from a single RGB images is a fundamental task in computer vision, which is most directly solved using supervised deep learning. In the field of unsupervised learning of depth from a single RGB image, depth is not given explicitly. Existing work in the field receives either a stereo pair, a monocular video, or multiple views, and, using losses that are based on structure-from-motion, trains a depth estimation network. In this work, we rely, instead of different views, on depth from focus cues. Learning is based on a novel Point Spread Function convolutional layer, which applies location specific kernels that arise from the Circle-Of-Confusion in each image location. We evaluate our method on data derived from five common datasets for depth estimation and lightfield images, and present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches. Since the phenomenon of depth from defocus is not dataset specific, we hypothesize that learning based on it would overfit less to the specific content in each dataset. Our experiments show that this is indeed the case, and an estimator learned on one dataset using our method provides better results on other datasets, than the directly supervised methods.Our method relies on a novel Point Spread Function (PSF) layer, which preforms a local operation over an image, with a location dependent kernel which is computed "on-the-fly", according to the estimated parameters of the PSF at each location. More specifically, the layer receives three inputs: an all-in-focus image, estimated depth-map and camera parameters, and outputs an image at one specific focus. This image is then compared to the training images to compute a loss. Both the forward and backward operations of the layer are efficiently computed using a dedicated CUDA kernel. This layer is then used as part of a novel architecture, combining the successful ASPP architecture [5,9]. To improve the ASPP block, we add dense connections [16], followed by self-attention [42].We evaluate our method on all relevant benchmarks we were able to obtain. These include the flower lightfield dataset and the multifocus indoor and outdoor scene dataset, for which we compare the ability to generate unseen focus arXiv:2001.05036v1 [cs.CV]

show abstract

Unsupervised Learning of Depth and Ego-Motion from Video

Cited by 2,501 publications

References 55 publications

DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints

DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints

End-to-End Depth From Motion With Stabilized Monocular Videos

Single Image Depth Estimation Trained via Depth From Defocus Cues

Contact Info

Product

Resources

About