The extraction of buildings in aerial remote sensing applications is an important and challenging task. Most existing methods extract buildings based on local area attention, ignoring the loss of accuracy due to the global structure of the building. However, global structural features of buildings with strong coupling relationships in complex scenes are difficult to extract, such as the edges and bodies of buildings, leading to discontinuous results. Therefore, Multiscale Decoupled Body and Edge Supervision Network (MDBES-Net), which can consider both edge optimization and inner consistency, is proposed to solve these problems. MDBES-Net consists of the Body-Mask-Edge Consistency Constraint base network (BMECC), Decoupling the Body and Edge Aware module (DBEA), and the Channel Decoupled Attention module (CDA). First, Body-Mask-Edge consistency constraint supervision is established by body and edge labels to jointly improve the segmentation effect in the BMECC base network. Second, In the mutiscale DBEA module, building features are warped by a learnable flow field to make body parts more consistent and edges more detailed. Finally, the CDA module performs adaptive calibration of the re-coupled feature map channel response to minimize external background noise interference. Experiments on the open Massachusetts Building Dataset, WHU Building Dataset show that the proposed MDBES-Net can accurately extract buildings in complex scenarios, enabling complete building segmentation with refined boundaries and improved internal consistency.