Flow-Motion and Depth Network for Monocular Stereo and Beyond

Wang, Kaixuan; Shen, Shaojie

doi:10.1109/lra.2020.2975750

Cited by 18 publications

(11 citation statements)

References 25 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, realistic simulators were created for driving scenes, such as CARLA [224], Nvidia Drive Sim 2 , and indoor scenes, such as Habitat [225; 226]. Despite the usage of simulators, other datasets rely on game engines or general computer graphics engines to build their systems, such as SYNTHIA [99], Virtual KITTI [87], and Virtual KITTI 2 [88] that used Unity 3 as graphic engine, and GTA-SfM [142] that uses scenes from the game GTAV.…”

Section: Discussionmentioning

confidence: 99%

A Survey on RGB-D Datasets

Lopes,

Souza,

Pedrini

2022

Preprint

View full text Add to dashboard Cite

RGB-D data is essential for solving many problems in computer vision. Hundreds of public RGB-D datasets containing various scenes, such as indoor, outdoor, aerial, driving, and medical, have been proposed. These datasets are useful for different applications and are fundamental for addressing classic computer vision tasks, such as monocular depth estimation. This paper reviewed and categorized image datasets that include depth information. We gathered 203 datasets that contain accessible data and grouped them into three categories: scene/objects, body, and medical. We also provided an overview of the different types of sensors, depth applications, and we examined trends and future directions of the usage and creation of datasets containing depth data, and how they can be applied to investigate the development of generalizable machine learning models in the monocular depth estimation field.

show abstract

Section: Discussionmentioning

confidence: 99%

A Survey on RGB-D Datasets

Lopes,

Souza,

Pedrini

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…GTA-SFM [34]: GTA-SFM is a synthetic dataset rendered from GTA-V, an open-world game with large-scale city models. It contains 200 scenes for training and 19 scenes for testing.…”

Section: Datasetsmentioning

confidence: 99%

“…In default, we utilize CasMVSNet [3] as the backbone network. The split of train, valid and test sets in each dataset follows the official configuration in DTU [4], BlendedMVS [33] and GTA-SFM [34]. Since the semi-supervised MVS problem in this paper aims to remedy the urge for large-scale MVS data, we only use limited annotated ground truth during training.…”

Section: Implementation Detailsmentioning

confidence: 99%

Semi-supervised Deep Multi-view Stereo

Xu¹,

Zhou²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) of supervised and unsupervised settings. To combine their respective merits in accuracy and completeness, meantime reducing the demand for expensive labeled data, this paper explores a novel semi-supervised setting of learning-based MVS problem that only a tiny part of the MVS data is attached with dense depth ground truth. However, due to huge variation of scenarios and flexible setting in views, semisupervised MVS problem (Semi-MVS) may break the basic assumption in classic semi-supervised learning, that unlabeled data and labeled data share the same label space and data distribution. To handle these issues, we propose a novel semi-supervised MVS framework, namely SE-MVS. For the simple case that the basic assumption works in MVS data, consistency regularization encourages the model predictions to be consistent between original sample and randomly augmented sample via constraints on KL divergence. For further troublesome case that the basic assumption is conflicted in MVS data, we propose a novel style consistency loss to alleviate the negative effect caused by the distribution gap. The visual style of unlabeled sample is transferred to labeled sample to shrink the gap, and the model prediction of generated sample is further supervised with the label in original labeled sample. The experimental results on DTU, BlendedMVS, GTA-SFM, and Tanks&Temples datasets show the superior performance of the proposed method. With the same settings in backbone network, our proposed SE-MVS outperforms its fully-supervised and unsupervised baselines.

show abstract

“…The first predicts the optical flow, whereas the second takes this prediction into consideration while inferring depth maps and surface normals. A comparable approach is presented by Wang et al (2020) [25] where the network first jointly estimates optical flow and camera motion. A triangulation layer is then proposed to encode this information and, finally, a depth map is estimated.…”

Section: B Estimating Depth From Motionmentioning

confidence: 99%

Exploiting Motion Perception in Depth Estimation Through a Lightweight Convolutional Neural Network

Leite

Pinto

2021

IEEE Access

View full text Add to dashboard Cite

Understanding the surrounding 3D scene is of the utmost importance for many robotic applications. The rapid evolution of machine learning techniques has enabled impressive results when depth is extracted from a single image. High-latency networks are required to achieve these performances, rendering them unusable for time-constrained applications. This article introduces a lightweight Convolutional Neural Network (CNN) for depth estimation, NEON, designed for balancing both accuracy and inference times. Instead of solely focusing on visual features, the proposed methodology exploits the Motion-Parallax effect to combine the apparent motion of pixels with texture. This research demonstrates that motion perception provides crucial insight about the magnitude of movement for each pixel, which also encodes cues about depth since large displacements usually occur when objects are closer to the imaging sensor. NEON's performance is compared to relevant networks in terms of Root Mean Squared Error (RMSE), the percentage of correctly predicted pixels (δ 1 ) and inference times, using the KITTI dataset. Experiments prove that NEON is significantly more efficient than the current top ranked network, estimating predictions 12 times faster; while achieving an average RMSE of 3.118 m and a δ 1 of 94.5%. Ablation studies demonstrate the relevance of tailoring the network to use motion perception principles in estimating depth from image sequences, considering that the effectiveness and quality of the estimated depth map is similar to more computational demanding state-of-the-art networks. Therefore, this research proposes a network that can be integrated in robotic applications, where computational resources and processing-times are important constraints, enabling tasks such as obstacle avoidance, object recognition and robotic grasping.

show abstract

Flow-Motion and Depth Network for Monocular Stereo and Beyond

Cited by 18 publications

References 25 publications

A Survey on RGB-D Datasets

A Survey on RGB-D Datasets

Semi-supervised Deep Multi-view Stereo

Exploiting Motion Perception in Depth Estimation Through a Lightweight Convolutional Neural Network

Contact Info

Product

Resources

About