2021
DOI: 10.1609/aaai.v35i3.16329
|View full text |Cite
|
Sign up to set email alerts
|

HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation

Abstract: Self-supervised learning shows great potential in monocular depth estimation, using image sequences as the only source of supervision. Although people try to use the high-resolution image for depth estimation, the accuracy of prediction has not been significantly improved. In this work, we find the core reason comes from the inaccurate depth estimation in large gradient regions, making the bilinear interpolation error gradually disappear as the resolution increases. To obtain more accu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
71
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 161 publications
(71 citation statements)
references
References 25 publications
0
71
0
Order By: Relevance
“…Thus, to achieve good performance, this formulation requires the network to accurately perceive the scene structure: a challenging task, especially for regions with hard to distinguish foreground objects from the background. Current SoTA networks [57,30] rely on traditional convolutional layers for aggregating context information and gradually lift the receptive field of the network through a cascade of layers and strided convolution [40]. However, given the intrinsic locality of the convolution operator, CNNs hardly model long-range appearance similarity among objects, in particular within the shallowest features.…”
Section: Motivationsmentioning
confidence: 99%
See 4 more Smart Citations
“…Thus, to achieve good performance, this formulation requires the network to accurately perceive the scene structure: a challenging task, especially for regions with hard to distinguish foreground objects from the background. Current SoTA networks [57,30] rely on traditional convolutional layers for aggregating context information and gradually lift the receptive field of the network through a cascade of layers and strided convolution [40]. However, given the intrinsic locality of the convolution operator, CNNs hardly model long-range appearance similarity among objects, in particular within the shallowest features.…”
Section: Motivationsmentioning
confidence: 99%
“…Considering the context difference between features at different scales, e.g. higher resolution features favour fine-grained details, we enhance cross-scale feature fusion with both spatial and channel attention mechanisms [30,69] (i.e., our Atten Block). Finally, four heads -made of two convolutional layers and a Sigmoid activation -are in charge of disparity (inverse depth) prediction from corresponding aggregated features, outputting maps at full, 1 2 , 1 4 , 1 8 resolution respectively.…”
Section: Depthnet Architecturementioning
confidence: 99%
See 3 more Smart Citations