2022
DOI: 10.1109/jsen.2022.3199265
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Abstract: Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesisbased depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use CNN-based networks in encoders. CNN is suitable for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 18 publications
(14 citation statements)
references
References 52 publications
0
14
0
Order By: Relevance
“…3. Unlike Hwang et al [27], which utilizes residual blocks to improve local features, our objective in developing HFM is to comprehensively integrate local detailed features from the ResNet branch and global features from the Transformer branch using adaptive feature alignment. The module generates four fused features {F i } 4 i=1 with a channel number of 64, reducing model complexity, enhancing computational efficiency, and preventing overfitting.…”
Section: Hfm Modulementioning
confidence: 99%
See 2 more Smart Citations
“…3. Unlike Hwang et al [27], which utilizes residual blocks to improve local features, our objective in developing HFM is to comprehensively integrate local detailed features from the ResNet branch and global features from the Transformer branch using adaptive feature alignment. The module generates four fused features {F i } 4 i=1 with a channel number of 64, reducing model complexity, enhancing computational efficiency, and preventing overfitting.…”
Section: Hfm Modulementioning
confidence: 99%
“…However, the pure Transformer model lacks the ability to model local information due to the absence of spatial inductive bias. To achieve more satisfactory results, some methods have started to combine Transformer with CNNs [13,22,[26][27][28] to leverage the strengths of both approaches. This combination allows for better performance in MDE tasks [13,22,26], as illustrated in Fig.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The main idea behind the self-supervised monocular depth prediction is to use view synthesis [3] to construct photometric consistency loss as supervision. Typically, the self-supervised monocular depth predictions [4][5][6] construct two neural networks to estimate depth and pose, using a photometric and gradient-based loss, called appearance loss, for training. Since the appearance loss is pretty fragile for illumination variations, to further improve the robustness and accuracy of monocular depth prediction, many loss schemes, such as ICP loss [7] and scale consistency geometric constraints [8][9][10] are proposed to promote the self-supervised learning.…”
Section: Introductionmentioning
confidence: 99%
“…While this approach is effective at leveraging prior knowledge such as object shape and textures, it is limited in its ability to learn the geometry and the motion of the scene. By contrast, using multiple frames [1], [2], [16] as input has the potential to provide a more comprehensive view of the scene and to help the model better understand the relationships between objects and their motions.…”
Section: Introductionmentioning
confidence: 99%