2022
DOI: 10.1109/access.2022.3170425
|View full text |Cite
|
Sign up to set email alerts
|

SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Abstract: The monocular depth estimation (MDE) is the task of estimating depth from a single frame. This information is an essential knowledge in many computer vision tasks such as scene understanding and visual odometry, which are key components in autonomous and robotic systems. Approaches based on the state of the art vision transformer architectures are extremely deep and complex not suitable for realtime inference operations on edge and autonomous systems equipped with low resources (i.e. robot indoor navigation an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(16 citation statements)
references
References 87 publications
0
16
0
Order By: Relevance
“…We can observe this fact in Figure 3 (third column), where missing depth measurements are represented as yellow pixels. As can be seen from the reported results, despite the high estimation error achieved by all the methods with respect to the terrestrial dataset, the proposed MobileNetV3 with only of training parameters outperforms both [ 12 ] and [ 5 ] while obtaining the same RMSE and a higher to [ 11 ]. In addition, the proposed model achieves a boost on the inference frequency equal to with respect to [ 11 ] on the same benchmark hardware.…”
Section: Resultsmentioning
confidence: 78%
See 3 more Smart Citations
“…We can observe this fact in Figure 3 (third column), where missing depth measurements are represented as yellow pixels. As can be seen from the reported results, despite the high estimation error achieved by all the methods with respect to the terrestrial dataset, the proposed MobileNetV3 with only of training parameters outperforms both [ 12 ] and [ 5 ] while obtaining the same RMSE and a higher to [ 11 ]. In addition, the proposed model achieves a boost on the inference frequency equal to with respect to [ 11 ] on the same benchmark hardware.…”
Section: Resultsmentioning
confidence: 78%
“…We designed those models in compliance with two essential concepts: the speed to maximize the inference frequency on embedded devices, and the robustness to maximize the estimation accuracy across two datasets [ 14 , 26 ]. Our solution exploits an encoder–decoder model, similar to previous related works, such as [ 11 , 12 ]. In more detail, we perform an in-depth study on lightweight encoders pre-trained on ImageNet [ 27 ] to improve their generalization capabilities.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…It involves the use of pixel shape and orientation for the identification of the distance of objects within 2D images and video from the device that recorded it. Its utility is mainly in photography and depth estimation for self-driving vehicles, while within our sources, it was mostly used for personal projects such as in [ 88 ]. The performance of these applications is covered in Table 4 as well as a comparison graph being provided in Figure 4 .…”
Section: Application Based System Comparisonmentioning
confidence: 99%