Aiming at the problem that the deep neural network pointnet does not introduce local features, the segmentation accuracy is not high and the segmentation efficiency of pointnet++ is low. On the basis of pointnet, a two-way feature extraction method, pillarpointnet, is proposed. Our network is divided into upper and lower parallel paths. The upper path is a simplified version of pointnet, which is used to extract global features. In the lower path, we use pillars instead of voxels to reduce the problem of wasting computing power due to inconsistency in density. The features of each voxel are then assigned to the points within the grid to represent the domain features of the points. Finally, the local features, global features and the points are concatenated and put into the segmentation network. The final experimental results show that compared to some current state-of-the-art segmentation networks, pillarpointnet maintains a good balance between speed and accuracy.
Detecting 3D objects in a crowd remains a challenging problem since the cars and pedestrians often gather together and occlude each other in the real world. The Pointpillar is the leader in 3D object detection, its detection process is simple, and the detection speed is fast. Due to the use of maxpooling in the Voxel Feature Encode (VFE) stage to extract global features, the fine-grained features will disappear, resulting in insufficient feature expression ability in the feature pyramid network (FPN) stage, so the object detection of small targets is not accurate enough. This paper proposes to improve the detection effect of networks in complex environments by integrating attention mechanisms and the Pointpillar. In the VFE stage of the model, the mixedattention module (HA) was added to retain the spatial structure information of the point cloud to the greatest extent from the three perspectives: local space, global space, and points. The Convolutional Block Attention Module (CBAM) was embedded in FPN to mine the deep information of pseudoimages. The experiments based on the KITTI dataset demonstrated our method had better performance than other state-of-the-art single-stage algorithms. Compared with another model, in crowd scenes, the mean average precision (mAP) under the bird's-eye view (BEV) detection benchmark increased from 59.20% of Pointpillar and 66.19% of TANet to 69.91 of ours, the mAP under the 3D detection benchmark was increased from 62% of TANet to 65.11% of ours, and the detection speed only dropped from 13.1 fps of Pointpillar to 12.8 fps of ours.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.