There remains several challenges that are encountered in the task of extracting buildings from aerial imagery using convolutional neural networks (CNNs). First, the tremendous complexity of existing building extraction networks impedes their practical application. In addition, it is arduous for networks to sufficiently utilize the various building features in different images. To address these challenges, we propose an efficient network called MSL-Net that focuses on both multiscale building features and multilevel image features. First, we use depthwise separable convolution (DSC) to significantly reduce the network complexity, and then we embed a group normalization (GN) layer in the inverted residual structure to alleviate network performance degradation. Furthermore, we extract multiscale building features through an atrous spatial pyramid pooling (ASPP) module and apply long skip connections to establish long-distance dependence to fuse features at different levels of the given image. Finally, we add a deformable convolution network layer before the pixel classification step to enhance the feature extraction capability of MSL-Net for buildings with irregular shapes. The experimental results obtained on three publicly available datasets demonstrate that our proposed method achieves state-of-the-art accuracy with a faster inference speed than that of competing approaches. Specifically, the proposed MSL-Net achieves 90.4%, 81.1% and 70.9% intersection over union (IoU) values on the WHU Building Aerial Imagery dataset, Inria Aerial Image Labeling dataset and Massachusetts Buildings dataset, respectively, with an inference speed of 101.4 frames per second (FPS) for an input image of size 3 × 512 × 512 on an NVIDIA RTX 3090 GPU. With an excellent tradeoff between accuracy and speed, our proposed MSL-Net may hold great promise for use in building extraction tasks.