Building footprint extraction is an essential process for various geospatial applications. The city management is entrusted with eliminating slums, which are increasing in rural areas. Compared with more traditional methods, several recent research investigations have revealed that creating footprints in dense areas is challenging and has a limited supply. Deep learning algorithms provide a significant improvement in the accuracy of the automated building footprint extraction using remote sensing data. The mask R-CNN object detection framework used to effectively extract building in dense areas sometimes fails to provide an adequate building boundary result due to urban edge intersections and unstructured buildings. Thus, we introduced a modified workflow to train ensemble of the mask R-CNN using two backbones ResNet (34, 101). Furthermore, the results were stacked to fine-grain the structure of building boundaries. The proposed workflow includes data preprocessing and deep learning, for instance, segmentation was introduced and applied to a light detecting and ranging (LiDAR) point cloud in a dense rural area. The outperformance of the proposed method produced better-regularized polygons that obtained results with an overall accuracy of 94.63%.