2023
DOI: 10.1007/s11633-023-1458-0
|View full text |Cite
|
Sign up to set email alerts
|

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Zhenyu Li,
Zehui Chen,
Xianming Liu
et al.

Abstract: This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention me… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 76 publications
(7 citation statements)
references
References 51 publications
0
7
0
Order By: Relevance
“…The authors also integrate a revisited version of CutDepth data augmentation method [27] which is able to improve the training process on the NYU Depth v2 dataset without needing additional data. Li et al propose DepthFormer [6] and BinsFormer [28], where the first one is composed of a fully-transformer encoder and a convolutional decoder interleaved by an interaction module to enhance transformer encoded and CNN decoded features. Differently, in BinsFormer the idea of the authors is to use a multi-scale transformer decoder to generate adaptive bins and to recover spatial geometry information from the encoded features.…”
Section: B Vit-based Mde Methodsmentioning
confidence: 99%
“…The authors also integrate a revisited version of CutDepth data augmentation method [27] which is able to improve the training process on the NYU Depth v2 dataset without needing additional data. Li et al propose DepthFormer [6] and BinsFormer [28], where the first one is composed of a fully-transformer encoder and a convolutional decoder interleaved by an interaction module to enhance transformer encoded and CNN decoded features. Differently, in BinsFormer the idea of the authors is to use a multi-scale transformer decoder to generate adaptive bins and to recover spatial geometry information from the encoded features.…”
Section: B Vit-based Mde Methodsmentioning
confidence: 99%
“…For the decoder, a local planar guidance (LPG) layer was proposed, which effectively establishes a direct and explicit relationship between the feature extracted from the encoder and the final output. In 2022, a paper by Zhenyu Li et al [ 12 ] sought to improve both the global and local features extracted from the encoder through the HAHI (hierarchical aggregation heterogeneous integration) module. The HAHI module consists of a self-attention module for the enhancement of features obtained from hierarchical layers of the Swin Transformer and a cross-attention module for affinity modeling of features obtained from two heterogeneous encoder branches.…”
Section: Related Workmentioning
confidence: 99%
“…The authors in Ref. [19] use the encoder composed of the Transformer [20] branch and CNN branch to fully obtain long-distance correlation and local information, and multi-branch encoder leads to the complex model. The work in Ref.…”
Section: Related Wordmentioning
confidence: 99%