The extraction of buildings from multispectral Light Detection and Ranging (LiDAR) data holds significance in various domains such as urban planning, disaster response, and environmental monitoring. State-of-the-art deep learning models, including Point Convolutional Neural Network (Point CNN) and Mask Region-based Convolutional Neural Network (Mask R-CNN), have effectively addressed this particular task. Data and application characteristics affect model performance. This research compares multispectral LiDAR building extraction models, Point CNN and Mask R-CNN. Models are tested for accuracy, efficiency, and capacity to handle irregularly spaced point clouds using multispectral LiDAR data. Point CNN extracts buildings from multispectral LiDAR data more accurately and efficiently than Mask R-CNN. CNN-based point cloud feature extraction avoids preprocessing like voxelization, improving accuracy and processing speed over Mask R-CNN. CNNs can handle LiDAR point clouds with variable spacing. Mask R-CNN outperforms Point CNN in some cases. Mask R-CNN uses image-like data instead of point clouds, making it better at detecting and categorizing objects from different angles. The study emphasizes selecting the right deep learning model for building extraction from multispectral LiDAR data. Point CNN or Mask R-CNN for accurate building extraction depends on the application. For building extraction from multispectral LiDAR data, two approaches were compared utilizing precision, recall, and F1 score. The point-CNN model outperformed Mask R-CNN. The point-CNN model had 93.40% precision, 92.34% recall, and 92.72% F1 score. Mask R-CNN has moderate precision, recall, and F1.