2022
DOI: 10.48550/arxiv.2201.06357
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Disentangled Latent Transformer for Interpretable Monocular Height Estimation

Abstract: Monocular height estimation (MHE) from remote sensing imagery has high potential in generating 3D city models efficiently for a quick response to natural disasters. Most existing works pursue higher performance. However, there is little research exploring the interpretability of MHE networks. In this paper, we target at exploring how deep neural networks predict height from a single monocular image. Towards a comprehensive understanding of MHE networks, we propose to interpret them from multiple levels: 1) Neu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…You et al first determined the depth selectivity of some hidden units, which showed good insights for interpreting monocular depth estimation models [18]. Zhi et al proposed a novel disentangled latent Transformer model based on the multi-level interpretation [19]. Santo et al proposed a deep photometric stereo network (DPSN) by applying a photometric stereo method based on deep learning away from simplified image formation models such as the general Lambert model [20].…”
Section: Pixel-wise Dense Predictionmentioning
confidence: 99%
“…You et al first determined the depth selectivity of some hidden units, which showed good insights for interpreting monocular depth estimation models [18]. Zhi et al proposed a novel disentangled latent Transformer model based on the multi-level interpretation [19]. Santo et al proposed a deep photometric stereo network (DPSN) by applying a photometric stereo method based on deep learning away from simplified image formation models such as the general Lambert model [20].…”
Section: Pixel-wise Dense Predictionmentioning
confidence: 99%
“…2) Multi-task Learning Networks: Multi-task learning networks introduce auxiliary tasks in addition to height predictions, with the expectation that both tasks support each other during training. Usually, based on the assumption that heights and semantics are highly correlated [33], semantic segmentation can be regarded as an auxiliary task for height estimation. For example, Srivastava et al [34] were the first to showcase the gains the auxiliary semantic segmentation head brings.…”
Section: A Monocular Height Estimationmentioning
confidence: 99%
“…With the advent of deep learning, various multi-modal learning networks [12,13] are proposed for both the computer vision and remote sensing communities [14,15]. VQA for natural images has been developed for many years [16,17], and VQA for remote sensing images has also been a hot research topic in recent years [18,19].…”
Section: Introductionmentioning
confidence: 99%