HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation

Lyu, Xiaoyang; Liu, Liang; Wang, Mengmeng; Kong, Xin; Liu, Lina; Liu, Yong; Chen, Xinxin; Yuan, Yi

doi:10.1609/aaai.v35i3.16329

Cited by 161 publications

(71 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, to achieve good performance, this formulation requires the network to accurately perceive the scene structure: a challenging task, especially for regions with hard to distinguish foreground objects from the background. Current SoTA networks [57,30] rely on traditional convolutional layers for aggregating context information and gradually lift the receptive field of the network through a cascade of layers and strided convolution [40]. However, given the intrinsic locality of the convolution operator, CNNs hardly model long-range appearance similarity among objects, in particular within the shallowest features.…”

Section: Motivationsmentioning

confidence: 99%

“…Considering the context difference between features at different scales, e.g. higher resolution features favour fine-grained details, we enhance cross-scale feature fusion with both spatial and channel attention mechanisms [30,69] (i.e., our Atten Block). Finally, four heads -made of two convolutional layers and a Sigmoid activation -are in charge of disparity (inverse depth) prediction from corresponding aggregated features, outputting maps at full, 1 2 , 1 4 , 1 8 resolution respectively.…”

Section: Depthnet Architecturementioning

confidence: 99%

“…Following [17,30,69,57], our PoseNet favors a simple, yet effective implementation. Specifically, our PoseNet uses the lightweight structure of ResNet18 [20].…”

Section: Posenetmentioning

confidence: 99%

“…Overall, network training requires about 15 hours. In our experiments, we adopt the same data augmentation detailed in [17,30].…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Different kinds of backbone, such as VGGNet, ResNet, HRNet and PackNet, made their way into the self-supervised monocular depth estimation task [71,17,69,18]. Moreover, to improve the feature extraction and processing ability, new frameworks like HR-Depth [30] and CADepth [57] also introduced attention modules. However, we argue that a shared shortcoming of existing self-supervised models falls in the reduced receptive field of Convolutional Neural Networks (CNNs).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Zhao¹,

Zhang²,

Poggi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Section: Motivationsmentioning

confidence: 99%

Section: Depthnet Architecturementioning

confidence: 99%

“…Following [17,30,69,57], our PoseNet favors a simple, yet effective implementation. Specifically, our PoseNet uses the lightweight structure of ResNet18 [20].…”

Section: Posenetmentioning

confidence: 99%

“…Overall, network training requires about 15 hours. In our experiments, we adopt the same data augmentation detailed in [17,30].…”

Section: Implementation Detailsmentioning

confidence: 99%