MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Zhao, Chaoqiang; Zhang, Youmin; Poggi, Matteo; Tosi, Fabio; Guo, Xianda; Zhu, Zheng; Guan, Huaijin; Yang, Tao; Mattoccia, Stefano

doi:10.48550/arxiv.2208.03543

Cited by 3 publications

(8 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our accuracy is superior to that of the newly proposed approaches, such as MonoFormer [64], CADepth [21], and DIFFNet [23] in all metrics. We also contrast the most advanced approach currently available, MonoViT [22]. Our results on Abs Rel and δ 1 are comparable, but we perform better on other measures, particularly Sq Rel.…”

Section: Quantitative Evaluationmentioning

confidence: 69%

“…Furthermore, Godard et al [6] proposed a classical method Monodepth2, and they adopted an automasking scheme to filter out invalid pixels from moving objects and introduced a minimum reprojection loss to address occlusions. Based on Monodepth2, numerous current selfsupervised monocular depth estimation approaches [21][22][23] are further researched. Liu et al [24] proposed a domain-separated network for self-supervised depth estimation of allday images.…”

Section: B Self-supervised Monocular Depth Estimationmentioning

confidence: 99%

“…with λ set to 10 −3 and γ set to 0.1. Similar to previous works [6,22], we apply a per-pixel binary mask, i.e. µ ∈ {0, 1}, which is formulated as:…”

Section: Knowledge Distillationmentioning

confidence: 99%

“…The Make3D dataset is an outdoor dataset with a scene similar to KITTI with a fixed image size of 1704×2272, containing a training set of 400 image-depth pairs and a test set of 134 image-depth pairs, which is generally used as a generalization test for monocular depth estimation. Following previous preprocessing [6,22] on a center crop with a 2×1 ratio, we test the performance of different solutions [6,22,45].…”

Section: B Datasetsmentioning

confidence: 99%

“…We indicate with the white dotted box where our method performs better than other methods. For example, in the visualization results, it can be found that MonoViT [22] does not perform well in the depth estimation of car mirrors, and Monodepth2 [6] does not perform well in overlapping pedestrian occlusion areas.…”

Section: Quantitative Evaluationmentioning

confidence: 99%

See 4 more Smart Citations

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Liu¹,

Li²,

Shao³

et al. 2023

Preprint

View full text Add to dashboard Cite

Monocular depth estimation plays a fundamental role in computer vision. Due to the costly acquisition of depth ground truth, self-supervised methods that leverage adjacent frames to establish a supervisory signal have emerged as the most promising paradigms. In this work, we propose two novel ideas to improve self-supervised monocular depth estimation: 1) self-reference distillation and 2) disparity offset refinement. Specifically, we use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision during the training process. The teacher model has the same structure as the student model, with weights inherited from the historical student model. In addition, a multiview check is introduced to filter out the outliers produced by the teacher model. Furthermore, we leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets, which are used to refine the disparity output incrementally by aligning disparity information at different scales. The experimental results on the KITTI and Make3D datasets show that our method outperforms previous state-ofthe-art competitors.

show abstract

Section: Quantitative Evaluationmentioning

confidence: 69%

Section: B Self-supervised Monocular Depth Estimationmentioning

confidence: 99%

“…with λ set to 10 −3 and γ set to 0.1. Similar to previous works [6,22], we apply a per-pixel binary mask, i.e. µ ∈ {0, 1}, which is formulated as:…”

Section: Knowledge Distillationmentioning

confidence: 99%

Section: B Datasetsmentioning

confidence: 99%

Section: Quantitative Evaluationmentioning

confidence: 99%

See 3 more Smart Citations

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Liu¹,

Li²,

Shao³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

REAL-NET: A Monochromatic Depth Estimation Using REgional Attention and Local Feature Mapping

Bhandari,

Palit

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

From a Visual Scene to a Virtual Representation: A Cross-Domain Review

et al. 2023

View full text Add to dashboard Cite

The widespread use of smartphones and other low-cost equipment as recording devices, the massive growth in bandwidth, and the ever-growing demand for new applications with enhanced capabilities, made visual data a must in several scenarios, including surveillance, sports, retail, entertainment, and intelligent vehicles. Despite significant advances in analyzing and extracting data from images and video, there is a lack of solutions able to analyze and semantically describe the information in the visual scene so that it can be efficiently used and repurposed. Scientific contributions have focused on individual aspects or addressing specific problems and application areas, and no cross-domain solution is available to implement a complete system that enables information passing between cross-cutting algorithms. This paper analyses the problem from an end-to-end perspective, i.e., from the visual scene analysis to the representation of information in a virtual environment, including how the extracted data can be described and stored. A simple processing pipeline is introduced to set up a structure for discussing challenges and opportunities in different steps of the entire process, allowing to identify current gaps in the literature. The work reviews various technologies specifically from the perspective of their applicability to an endto-end pipeline for scene analysis and synthesis, along with an extensive analysis of datasets for relevant tasks.

show abstract

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Cited by 3 publications

References 59 publications

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

REAL-NET: A Monochromatic Depth Estimation Using REgional Attention and Local Feature Mapping

From a Visual Scene to a Virtual Representation: A Cross-Domain Review

Contact Info

Product

Resources

About