Building extraction from high-resolution aerial images is critical in geospatial applications such as telecommunications, dynamic urban monitoring, updating geographic databases, urban planning, disaster monitoring, and navigation. Automatic building extraction is a massive task because buildings in various places have varied spectral and geometric qualities. As a result, traditional image processing approaches are insufficient for autonomous building extraction from high-resolution aerial imaging applications. Automatic object extraction from high-resolution images has been achieved using semantic segmentation and deep learning models, which have become increasingly important in recent years. In this study, the U-Net model was used for building extraction, initially designed for biomedical image analysis. The encoder part of the U-Net model has been improved with ResNet50, VGG19, VGG16, DenseNet169, and Xception. However, three other models have been implemented to test the performance of the model studied: PSPNet, FPN, and LinkNet. The performance analysis through the intersection of union method has shown that U-Net with the VGG16 encoder presents the best results compared to the other models with a high IoU score of 83.06%. This research aims to examine the effectiveness of these four approaches for extracting buildings from high-resolution aerial data.