Unsupervised Monocular Depth and Ego-Motion Learning With Structure and Semantics

Casser, Vincent; Pirk, Soeren; Mahjourian, Reza; Angelova, Anelia

doi:10.1109/cvprw.2019.00051

Cited by 95 publications

(84 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This mask can be obtained from a pretrained segmentation model. Unlike in prior work [7], instance segmentation and tracking are not required, as we need a single "possibly mobile" mask. In fact, we show that a union of bounding boxes is sufficient (see Fig.…”

Section: Learning Object Motionmentioning

confidence: 99%

“…Cityscapes Table 2 summarizes the evaluation metrics of models trained and tested on Cityscapes. We follow the established protocol by previous work, using the disparity for evaluation [7,30]. Since this is a very challenging benchmark with many dynamic objects, very few approaches have evaluated on it.…”

Section: Depthmentioning

confidence: 99%

“…We evaluated our egomotion prediction on the KITTI sequences 09 and 10. The common 5-point Absolute Trajectory Error (ATE) metric [50,7,48,13] measures local agreement between the the estimated trajectories and the respective groundtruth. However assessing the usefulness for a method for localization requires evaluating its accuracy in predicting location.…”

Section: Odometrymentioning

confidence: 99%

“…Seq. 10 Metric ATE t rel ATE t rel Zhou [50] 0.021 17.84% 0.020 37.91% GeoNet [48] 0.012 / 0.012 / Zhan [49] / 11.92% / 12.45% Mahjourian [25] 0.013 / 0.012 / Struct2depth [7] 0. Table 6: Absolute Trajectory Error (ATE) [50] and average relative translational drift (t rel ) [33] on the 09 and 10 KITTI odometry sequences.…”

Section: Odometrymentioning

confidence: 99%

“…Second, we are the first in this context to address occlusions directly, in a geometric way, from the predicted depth as it is. Lastly, we substantially reduce the amount of semantic understanding needed to address moving elements in the scene: Instead of segmenting every instance of a moving object and tracking it across frames [7], we need a single mask that covers pixels that could belong to a moving object. This mask can be very rough, and in fact can be a union of rectangular bounding boxes.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras

Gordon

Jonschkowski

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

380

267

View full text Add to dashboard Cite

We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as supervision signal. Similarly to prior work, our method learns by applying differentiable warping to frames and comparing the result to adjacent ones, but it provides several improvements: We address occlusions geometrically and differentiably, directly using the depth maps as predicted during training. We introduce randomized layer normalization, a novel powerful regularizer, and we account for object motion relative to the scene. To the best of our knowledge, our work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale. We evaluate our results on the Cityscapes, KITTI and Eu-RoC datasets, establishing new state of the art on depth prediction and odometry, and demonstrate qualitatively that depth prediction can be learned from a collection of YouTube videos.

show abstract

Section: Learning Object Motionmentioning

confidence: 99%

Section: Depthmentioning

confidence: 99%