3D semantic segmentation of point cloud aims at assigning semantic labels to each point by utilizing and respecting the 3D representation of the data. Detailed 3D semantic segmentation of urban areas can assist policymakers, insurance companies, governmental agencies for applications such as urban growth assessment, disaster management, and traffic supervision. The recent proliferation of remote sensing techniques has led to producing high resolution multimodal geospatial data. Nonetheless, currently, only limited technologies are available to fuse the multimodal dataset effectively. Therefore, this paper proposes a novel deep learning-based end-to-end Point-wise LiDAR and Image Multimodal Fusion Network (PMNet) for 3D segmentation of aerial point cloud by fusing aerial image features. PMNet respects basic characteristics of point cloud such as unordered, irregular format and permutation invariance. Notably, multi-view 3D scanned data can also be trained using PMNet since it considers aerial point cloud as a fully 3D representation. The proposed method was applied on two datasets (1) collected from the urban area of Osaka, Japan and (2) from the University of Houston campus, USA and its neighborhood. The quantitative and qualitative evaluation shows that PMNet outperforms other models which use non-fusion and multimodal fusion (observational-level fusion and feature-level fusion) strategies. In addition, the paper demonstrates the improved performance of the proposed model (PMNet) by over-sampling/augmenting the medium and minor classes in order to address the class-imbalance issues.