Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery

Hamaguchi, Ryuhei; Fujita, Aito; Nemoto, Keisuke; Imaizumi, Tsutomu; Hikosaka, Shuhei

doi:10.48550/arxiv.1709.00179

Cited by 8 publications

(25 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Key point classification can benefit from larger receptive field kernels which can provide better context aware feature extraction. This can be achieved efficiently using dilated convolutions [21] which increase receptive field of filters without increasing the number of parameters. Average pooling is better suited to preserve the overall context features compared to max pooling which picks the most dominant feature in the receptive field.…”

Section: B Network Architecturementioning

confidence: 99%

“…To further improve the receptive fields of the filters without increasing parameter size, we use dilated convolutions throughout the downsampling blocks. Hamaguchi et al [21] showed that dilated convolutions are useful in feature extraction for small and crowded objects. This feature suits our problem of object detection in BEV images where objects are relatively small.…”

Section: B Network Architecturementioning

confidence: 99%

See 1 more Smart Citation

BEVDetNet: Bird's Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving

Mohapatra¹,

Yogamani²,

Gotzig³

et al. 2021

Preprint

View full text Add to dashboard Cite

LiDAR based 3D object detection is a crucial module in autonomous driving particularly for long range sensing. Most of the research is focused on achieving higher accuracy and these models are not optimized for deployment on embedded systems from the perspective of latency and power efficiency. For high speed driving scenarios, latency is a crucial parameter as it provides more time to react to dangerous situations. Typically a voxel or point-cloud based 3D convolution approach is utilized for this module. Firstly, they are inefficient on embedded platforms as they are not suitable for efficient parallelization. Secondly, they have a variable runtime due to level of sparsity of the scene which is against the determinism needed in a safety system. In this work, we aim to develop a very low latency algorithm with fixed runtime. We propose a novel semantic segmentation architecture as a single unified model for object center detection using key points, box predictions and orientation prediction using binned classification in a simpler Bird's Eye View (BEV) 2D representation. The proposed architecture can be trivially extended to include semantic segmentation classes like road without any additional computation. The proposed model has a latency of 4 ms on the embedded Nvidia Xavier platform. The model is 5X faster than other top accuracy models with a minimal accuracy degradation of 2% in Average Precision at IoU=0.5 on KITTI dataset.• A very low-latency model achieving 2 ms on Nvidia Xavier embedded platform.

show abstract

Section: B Network Architecturementioning

confidence: 99%

Section: B Network Architecturementioning

confidence: 99%

BEVDetNet: Bird's Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving

Mohapatra¹,

Yogamani²,

Gotzig³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…where RF i , K i , R i and Stride i are the size of receptive field, size of kernel, dilation rate and stride of the ith layer respectively. Inspired by [41] and other novel networks [20], [43], the dilation rate is usually set to sequence, we deployed dilated convolutions in stage2 demonstrated in Table I, and the sequence of dilation rate is 1, 2, 1, 4, 1, 8, 1, 16. Using (7) and ( 8), we can calculate the receptive field of our ESFNet is 599, and if we remove the dilated convolutions in stage2, the receptive field is only 183 that is not enough to cover the whole image.…”

Section: A Our Core Modulementioning

confidence: 99%

ESFNet: Efficient Network for Building Extraction from High-Resolution Aerial Images

Lin

Jing

Song³

et al. 2019

Preprint

View full text Add to dashboard Cite

Building footprint extraction from high-resolution aerial images is always an essential part of urban dynamic monitoring, planning and management. It has also been a challenging task in remote sensing research. In recent years, deep neural networks have made great achievement in improving accuracy of building extraction from remote sensing imagery. However, most of existing approaches usually require large amount of parameters and floating point operations for high accuracy, it leads to high memory consumption and low inference speed which are harmful to research. In this paper, we proposed a novel efficient network named ESFNet which employs separable factorized residual block and utilizes the dilated convolutions, aiming to preserve slight accuracy loss with low computational cost and memory consumption. Our ESFNet obtains a better trade-off between accuracy and efficiency, it can run at over 100 FPS on single Tesla V100, requires 6x fewer FLOPs and has 18x fewer parameters than state-of-the-art real-time architecture ERFNet while preserving similar accuracy without any additional context module, post-processing and pre-trained scheme. We evaluated our networks on WHU Building Dataset and compared it with other state-of-the-art architectures. The result and comprehensive analysis show that our networks are benefit for efficient remote sensing researches, and the idea can be further extended to other areas. The code is public available at: https://github.com/mrluin/ESFNet-Pytorch

show abstract

“…D ILATED convolutions, also known as atrous convolutions, have been widely explored in deep convolutional neural networks (DCNNs) for various tasks, including semantic image segmentation [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], object detection [11], [12], [13], [14], audio generation [15], video modeling [16], and machine translation [17]. The idea of dilated filters was developed in the algorithm à trous for efficient wavelet decomposition in [18] and has been used in image pixel-wise prediction tasks to allow efficient computation [1], [2], [11], [12].…”

Section: Introductionmentioning

confidence: 99%

“…Dilation upsamples convolutional filters by inserting zeros between weights, as illustrated in Figure 1. It enlarges the receptive field, or field of view [5], [6], [8], but does not require training extra parameters in DCNNs. Dilated convolutions can be used in cascade to build multi-layer networks [15], [16], [17].…”

Section: Introductionmentioning

confidence: 99%

Smoothed Dilated Convolutions for Improved Dense Prediction

Wang

2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

113

View full text Add to dashboard Cite

Dilated convolutions, also known as atrous convolutions, have been widely explored in deep convolutional neural networks (DCNNs) for various dense prediction tasks. However, dilated convolutions suffer from the gridding artifacts, which hampers the performance. In this work, we propose two simple yet effective degridding methods by studying a decomposition of dilated convolutions. Unlike existing models, which explore solutions by focusing on a block of cascaded dilated convolutional layers, our methods address the gridding artifacts by smoothing the dilated convolution itself. In addition, we point out that the two degridding approaches are intrinsically related and define separable and shared (SS) operations, which generalize the proposed methods. We further explore SS operations in view of operations on graphs and propose the SS output layer, which is able to smooth the entire DCNNs by only replacing the output layer. We evaluate our degridding methods and the SS output layer thoroughly, and visualize the smoothing effect through effective receptive field analysis. Results show that our methods degridding yield consistent improvements on the performance of dense prediction tasks, while adding negligible amounts of extra training parameters. And the SS output layer improves the performance significantly and is very efficient in terms of number of training parameters.

show abstract

Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery

Cited by 8 publications

References 15 publications

BEVDetNet: Bird's Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving

BEVDetNet: Bird's Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving

ESFNet: Efficient Network for Building Extraction from High-Resolution Aerial Images

Smoothed Dilated Convolutions for Improved Dense Prediction

Contact Info

Product

Resources

About