Due to their individual shape, form, texture and colour variations, the automatic extraction of a building from high-resolution aerial photographs continues to be complicated. The Mask Region-based Convolutional neural network (Mask R-CNN) has shown recent improvements in object detection and extraction for updating data, which are superior to other methods. In this paper, a dataset consisting of aerial photography images acquired by aircraft in the urban and educational area of Institut Teknologi Sepuluh Nopember Surabaya to explore the potential of using Mask R-CNN, the art model, for instance, segmentation to automatically detect building footprints, which are essential attributes that define the urban fabric (which is critical to accelerating land cover updates with high highly accurate in terms of area and spatial assessment). The objective of this study was to implement Artificial Intelligence, especially with the Mask-RCNN method to perform building footprint detection. To enable this, aerial imagery was clipped into chip-sized images as training data for the model to learn. The model appeared to result in 73% precision. The model also shows the loss value graph, which represents the data well. Further study could focus on improving the precision of the model, which could also improve the result better.