Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

Casser, Vincent; Pirk, Soeren; Mahjourian, Reza; Angelova, Anelia

doi:10.48550/arxiv.1811.06152

Cited by 8 publications

(18 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We proposes a self-supervised learning framework inspired from their work, with significant modifications summarized as following: (1) Our system estimates the combined transformation, encapsulating both the camera ego-motion and the object motion. By contrast, Casser et al [3] estimate the object motion on top of the camera ego-motion, predicted by the Pose-net. Thus the accuracy of their object motion prediction is dependent on the performance of their Pose-net.…”

Section: Supervised Depth Estimationmentioning

confidence: 99%

“…In this work we try to solve the object motion by modelling it as a rigid-body transform. Similar idea is proposed in [3], where the pre-computed instance segmentation masks are utilized for individual object-motion prediction. We proposes a self-supervised learning framework inspired from their work, with significant modifications summarized as following: (1) Our system estimates the combined transformation, encapsulating both the camera ego-motion and the object motion.…”

Section: Supervised Depth Estimationmentioning

confidence: 99%

“…Abs Rel Sq Rel RMSE RMSE log δ < 1.25 δ < 1.25 2 δ < 1.253 Bad pixel percentage of disparity prediction, evaluated on KITTI stereo split.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Self-supervised Object Motion and Depth Estimation from Video

Dai

Patil

Hecker

et al. 2019

Preprint

View full text Add to dashboard Cite

We present a self-supervised learning framework to estimate the individual object motion and monocular depth from video. We model the object motion as a 6 degree-offreedom rigid-body transformation. The instance segmentation mask is leveraged to introduce the information of object. Compared with methods which predict pixel-wise optical flow map to model the motion, our approach significantly reduces the number of values to be estimated. Furthermore, our system eliminates the scale ambiguity of predictions, through employing the pre-computed camera egomotion and the left-right photometric consistency. Experiments on KITTI driving dataset demonstrate our system is capable to capture the object motion without external annotation, and contribute to the depth prediction in dynamic area. Our system outperforms earlier self-supervised approaches in terms of 3D scene flow prediction, and produces comparable results on optical flow estimation.

show abstract

Section: Supervised Depth Estimationmentioning

confidence: 99%

Section: Supervised Depth Estimationmentioning

confidence: 99%

See 1 more Smart Citation

Self-supervised Object Motion and Depth Estimation from Video

Dai

Patil

Hecker

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In these systems depth information can be used to decide whether to accelerate, brake or steer. Sonar, radar, and lidar 1 are examples of technologies that can be used to measure this information directly. As a complementary source of information or as a costeffective alternative, depth can be predicted from camera data.…”

Section: Introductionmentioning

confidence: 99%

“…[11] showed that it is possible to train depth prediction models on video data using image reconstruction as a supervision signal, by adding a parallel network that predicts the image-pair camera transformation that is required for image reconstruction. The reconstruc- 1 Radar that uses laser instead of radio waves tion computation will be discussed in further detail in the methods section.…”

Section: Introductionmentioning

confidence: 99%

Improving Self-Supervised Single View Depth Estimation by Masking Occlusion

Schellevis¹

2019

Preprint

View full text Add to dashboard Cite

Single view depth estimation models can be trained from video footage using a self-supervised end-to-end approach with view synthesis as the supervisory signal. This is achieved with a framework that predicts depth and camera motion, with a loss based on reconstructing a target video frame from temporally adjacent frames. In this context, occlusion relates to parts of a scene that can be observed in the target frame but not in a frame used for image reconstruction. Since the image reconstruction is based on sampling from the adjacent frame, and occluded areas by definition cannot be sampled, reconstructed occluded areas corrupt to the supervisory signal. In previous work [6] occlusion is handled based on reconstruction error; at each pixel location, only the reconstruction with the lowest error is included in the loss. The current study aims to determine whether performance improvements of depth estimation models can be gained by during training only ignoring those regions that are affected by occlusion.In this work we introduce occlusion mask, a mask that during training can be used to specifically ignore regions that cannot be reconstructed due to occlusions. Occlusion mask is based entirely on predicted depth information. We introduce two novel loss formulations which incorporate the occlusion mask. The method and implementation of [6] serves as the foundation for our modifications as well as the baseline in our experiments. We demonstrate that (i) incorporating occlusion mask in the loss function improves the performance of single image depth prediction models on the KITTI benchmark. (ii) loss functions that select from reconstructions based on error are able to ignore some of the reprojection error caused by object motion.

show abstract

Inpainting Semantic and Depth Features to Improve Visual Place Recognition in the Wild

Semenkov,

Karpov,

Savchenko

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Visual place recognition is one of the core modern computer vision tasks concerned with identifying location based on the image taken there. Modern state-of-the-art approaches heavily rely on RGB images which are largely affected by changes in the same scene such as varying daytime, illumination, seasonal changes, and presence of dynamic objects (people, vehicles). This results into a large difference between the images in the training dataset and the ones taken by a person in real life at the same place as a part of some application, rendering modern approaches less effective. To deal with this problem, we propose a novel approach that uses only geometrical information (shapes of buildings, terrains, trees, and their relevant positions) obtained from depth and semantic maps inpainted to remove dynamic objects. In this paper, we study two versions of the pipeline: the first one uses direct inpainting, and the second utilizes synthetic data to improve the inpainting process. Our most efficient model achieved 60.6% correct answers with synthetic refinement. With direct inpainting, it kept metrics high at 51.1%. With these compelling results, our approach offers a novel and effective alternative to known algorithms, making it an exciting avenue for future research in visual place recognition.

show abstract

Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

Cited by 8 publications

References 0 publications

Self-supervised Object Motion and Depth Estimation from Video

Self-supervised Object Motion and Depth Estimation from Video

Improving Self-Supervised Single View Depth Estimation by Masking Occlusion

Inpainting Semantic and Depth Features to Improve Visual Place Recognition in the Wild

Contact Info

Product

Resources

About