Automatically extracting urban buildings from remote sensing images has essential application value, such as urban planning and management. Gaofen-7 (GF-7) provides multi-perspective and multispectral satellite images, which can obtain three-dimensional spatial information. Previous studies on building extraction often ignored information outside the redāgreenāblue (RGB) bands. To utilize the multi-dimensional spatial information of GF-7, we propose a dual-stream multi-scale network (DMU-Net) for urban building extraction. DMU-Net is based on U-Net, and the encoder is designed as the dual-stream CNN structure, which inputs RGB images, near-infrared (NIR), and normalized digital surface model (nDSM) fusion images, respectively. In addition, the improved FPN (IFPN) structure is integrated into the decoder. It enables DMU-Net to fuse different band features and multi-scale features of images effectively. This new method is tested with the study area within the Fourth Ring Road in Beijing, and the conclusions are as follows: (1) Our network achieves an overall accuracy (OA) of 96.16% and an intersection-over-union (IoU) of 84.49% for the GF-7 self-annotated building dataset, outperforms other state-of-the-art (SOTA) models. (2) Three-dimensional information significantly improved the accuracy of building extraction. Compared with RGB and RGB + NIR, the IoU increased by 7.61% and 3.19% after using nDSM data, respectively. (3) DMU-Net is superior to SMU-Net, DU-Net, and IEU-Net. The IoU is improved by 0.74%, 0.55%, and 1.65%, respectively, indicating the superiority of the dual-stream CNN structure and the IFPN structure.