Building extraction aims to extract building pixels from remote sensing imagery, which plays a significant role in urban planning, dynamic urban monitoring, and many other applications. UNet3+ is widely applied in building extraction from remote sensing images. However, it still faces issues such as low segmentation accuracy, imprecise boundary delineation, and the complexity of network models. Therefore, based on the UNet3+ model, this paper proposes a 3D Joint Attention (3DJA) module that effectively enhances the correlation between local and global features, obtaining more accurate object semantic information and enhancing feature representation. The 3DJA module models semantic interdependence in the vertical and horizontal dimensions to obtain feature map spatial encoding information, as well as in the channel dimensions to increase the correlation between dependent channel graphs. In addition, a bottleneck module is constructed to reduce the number of network parameters and improve model training efficiency. Many experiments are conducted on publicly accessible WHU,INRIA and Massachusetts building dataset, and the benchmarks, BOMSC-Net, CVNet, SCA-Net, SPCL-Net, ACMFNet, MFCF-Net models are selected for comparison with the 3DJA-UNet3+ model proposed in this paper. The experimental results show that 3DJA-UNet3+ achieves competitive results in three evaluation indicators: overall accuracy, mean intersection over union, and F1-score. The code will be available at
https://github.com/EnjiLi/3DJA-UNet3Plus
.