2020
DOI: 10.48550/arxiv.2011.06961
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis

Abstract: Analyzing scenes thoroughly is crucial for mobile robots acting in different environments. Semantic segmentation can enhance various subsequent tasks, such as (semantically assisted) person perception, (semantic) free space detection, (semantic) mapping, and (semantic) navigation. In this paper, we propose an efficient and robust RGB-D segmentation approach that can be optimized to a high degree using NVIDIA TensorRT and, thus, is well suited as a common initial processing step in a complex system for scene an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(14 citation statements)
references
References 37 publications
(65 reference statements)
0
14
0
Order By: Relevance
“…Multimodal Learning Deep learning makes fusing different signals easier, which enable us to develop many multimodal frameworks. For example, [26,15,13,20,12] combine RGB and depth images to improve semantic segmentation; [7,11] fuse audio with video to do scene understanding; researchers also explore audio-visual source separation and localization [36,10]. In semi-supervised setting, [27] proposed a novel method, Total Correlation Gain Maximization (TCGM), based on information theory, which explores the information intersection by maximizing the total correlation gain among all the modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Multimodal Learning Deep learning makes fusing different signals easier, which enable us to develop many multimodal frameworks. For example, [26,15,13,20,12] combine RGB and depth images to improve semantic segmentation; [7,11] fuse audio with video to do scene understanding; researchers also explore audio-visual source separation and localization [36,10]. In semi-supervised setting, [27] proposed a novel method, Total Correlation Gain Maximization (TCGM), based on information theory, which explores the information intersection by maximizing the total correlation gain among all the modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Implementation details. Our segmentation UMT is based on a state-of-the-art method, i.e., ESANet [41]. For RGB modality, the ResNet34 backbone, downsampling method, and contextual module are employed following [31,49].…”
Section: Rgb-depth Semantic Segmentation Experimentsmentioning
confidence: 99%
“…Much current research on multi-modal fusion mainly revolves around the design of model architectures, such as middle fusion [41,45], late fusion [44] and attention-based fusion [14,41]. However, simply combining multiple modalities often results in unsatisfactory performance.…”
Section: Introductionmentioning
confidence: 99%
“…The data augmentation strategies we adopted include random cropping, rotation, and color jittering. We use ESANet [28], an efficient ResNet-based encoder, as our backbone. We use the common 40-class label setting and mean IoU(mIoU) as the evaluation metric.…”
Section: Nyuv2 Semantic Segmentationmentioning
confidence: 99%
“…In addition, we include an InfoNCE [24] baseline where we directly contrast multimodal input tuples without tuple disturbing and sample optimization. We also include supervised pretraining [28] methods for completeness.…”
Section: Nyuv2 Semantic Segmentationmentioning
confidence: 99%