DTS-Net: Depth-to-Space Networks for Fast and Accurate Semantic Object Segmentation

Ibrahem, Hatem; Salem, Ahmed; Kang, Hyun‐Soo

doi:10.3390/s22010337

Cited by 9 publications

(8 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DW-Conv is much faster than the standard convolution as it learns fewer parameters so it is key to the fast processing in our proposed method. Xception also proved to be a good feature extractor in recent research for multiple computer vision tasks, as it proved to be light enough for real-time applications because of the relatively low FLOPs count and a number of other parameters [ 29 , 30 ]; it also proved to be compatible with the pixel-shuffle [ 11 ] operation (also employed in our proposed method and is introduced in Section 3.2 ) as Xception with the pixel-shuffle showed high accuracy in performing the semantic segmentation task in DTS-Net [ 25 ]. As our method performs the semantic segmentation as a secondary task to predict the encoded line, we adopted a modified version of Xception for its robustness and high accuracy.…”

Section: Proposed Methodsmentioning

confidence: 99%

“…This algorithm can up-scale many low-resolution images of shape

(where

is the scaling factor) into a high-resolution image of shape (

) through pixel shuffling from the depth channel. This algorithm is fast and efficient in the construction of higher resolution images and especially segmentation masks as explored in detail in our previous research [ 25 , 26 ]. The progressive probabilistic Hough transform (PPHT) [ 12 ] is a popular method for straight line detection from a small set of edge points instead of all edge points used in the standard Hough transform (SHT) [ 27 ], thus PPHT is much faster than HT.…”

Section: Related Workmentioning

confidence: 99%

“…In addition, we plan to employ vision transformers (ViT) [ 36 ] in a future work to exploit the general context learning, which can be achieved using ViT and can be used to generate richer line features. Since the obtained object detection results are so promising, we aim also to extend the method in the future to perform instance segmentation in collaboration with our previous semantic segmentation method proposed in [ 25 ], which is one of the most-difficult high-level computer vision tasks. This method uses the original Xception architecture for semantic segmentation, so the same architecture can be trained to perform both object detection and semantic segmentation simultaneously, which is instance segmentation.…”

Section: Limitations and Future Workmentioning

confidence: 99%

See 2 more Smart Citations

LEOD-Net: Learning Line-Encoded Bounding Boxes for Real-Time Object Detection

Ibrahem

Salem

Kang

2022

Sensors

Self Cite

View full text Add to dashboard Cite

This paper proposes a learnable line encoding technique for bounding boxes commonly used in the object detection task. A bounding box is simply encoded using two main points: the top-left corner and the bottom-right corner of the bounding box; then, a lightweight convolutional neural network (CNN) is employed to learn the lines and propose high-resolution line masks for each category of classes using a pixel-shuffle operation. Post-processing is applied to the predicted line masks to filtrate them and estimate clear lines based on a progressive probabilistic Hough transform. The proposed method was trained and evaluated on two common object detection benchmarks: Pascal VOC2007 and MS-COCO2017. The proposed model attains high mean average precision (mAP) values (78.8% for VOC2007 and 48.1% for COCO2017) while processing each frame in a few milliseconds (37 ms for PASCAL VOC and 47 ms for COCO). The strength of the proposed method lies in its simplicity and ease of implementation unlike the recent state-of-the-art methods in object detection, which include complex processing pipelines.

show abstract

Section: Proposed Methodsmentioning

confidence: 99%

“…This algorithm can up-scale many low-resolution images of shape

(where

is the scaling factor) into a high-resolution image of shape (

Section: Related Workmentioning

confidence: 99%

Section: Limitations and Future Workmentioning

confidence: 99%

See 1 more Smart Citation

LEOD-Net: Learning Line-Encoded Bounding Boxes for Real-Time Object Detection

Ibrahem

Salem

Kang

2022

Sensors

Self Cite

View full text Add to dashboard Cite

show abstract

“…Lee et al [ 13 ] proposed a CNN-based method namely From big to small (BTS) which utilizes local planar guidance layers at different scales in the decoder stage that guides the feature maps to accurate depth predictions. We also provided challenging depth estimation results in previous research [ 14 , 15 ] in which we eliminate the complexity of the decoder in the encoder-decoder CNN architecture using depth-to-space (pixel-shuffle) image reconstruction. Although the previously stated methods attained relatively good results, the estimated depth in most of the stated methods has blurry results especially at the borders of the objects in the scene due to the inefficient encoding and decoding stages due to the local learning scheme naturally provided by the convolution algorithm.…”

Section: Related Workmentioning

confidence: 99%

“…Depth estimation is a critical task in a variety of computer vision applications, including 3D scene reconstruction from 2D images, medical 3D imaging, augmented reality, self-driving cars and robots, and 3D computer graphics and animations. The recent advances in depth estimation research have shown the effectiveness of the convolutional neural networks (CNNs) in performing such a task [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 ]. The encoder-decoder CNN architectures are the most used architectures in the dense prediction tasks [ 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 ] (image-like predictions such as semantic segmentation and depth estimation).…”

Section: Introductionmentioning

confidence: 99%

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

Ibrahem

Salem

Kang

2022

Sensors

Self Cite

View full text Add to dashboard Cite

The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.

show abstract

A comparative analysis of super-resolution techniques for enhancing micro-CT images of carbonate rocks

Soltanmohammadi,

Faroughi

2023

Applied Computing and Geosciences

View full text Add to dashboard Cite

DTS-Net: Depth-to-Space Networks for Fast and Accurate Semantic Object Segmentation

Cited by 9 publications

References 44 publications

LEOD-Net: Learning Line-Encoded Bounding Boxes for Real-Time Object Detection

LEOD-Net: Learning Line-Encoded Bounding Boxes for Real-Time Object Detection

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

A comparative analysis of super-resolution techniques for enhancing micro-CT images of carbonate rocks

Contact Info

Product

Resources

About