Extracting buildings accurately from very highresolution (VHR) remote sensing imagery is challenging due to diverse building appearances, spectral variability, and complex background in VHR remote sensing images. Recent studies mainly adopt a variant of the Fully Convolutional Network (FCN) with an encoder-decoder architecture to extract buildings, which has shown promising improvement over conventional methods. However, FCN-based encoder-decoder models still fail to fully utilize the implicit characteristics of building shapes. This adversely affects the accurate localization of building boundaries, which is particularly relevant in building mapping. A contour-guided and local structure-aware encoder-decoder network (CGSANet) is proposed to extract buildings with more accurate boundaries. CGSANet is a multi-task network composed of a contour-guided (CG) and a multi-region guided (MRG) module. The CG module is supervised by a building contour that effectively learns building contour-related spatial features to retain the shape pattern of buildings. The MRG module is deeply supervised by four building regions that further capture multiscale and contextual features of buildings. In addition, a hybrid loss function was designed to improve the structure learning ability of CGSANet. These three improvements benefit each other synergistically to produce high-quality building extraction results. Experiment results on the WHU and NZ32km2 building datasets demonstrate that compared with the tested algorithms, CGSANet can produce more accurate building extraction results and achieve the best intersection over union (IoU) value 91.55% and 90.02%, respectively. Experiments on the INRIA building dataset further demonstrate the ability for generalization of the proposed framework, indicating great practical potential.