Automated methods to extract buildings from very high resolution (VHR) remote sensing data have many applications in a wide range of fields. Many convolutional neural network (CNN) based methods have been proposed and have achieved significant advances in the building extraction task. In order to refine predictions, a lot of recent approaches fuse features from earlier layers of CNNs to introduce abundant spatial information, which is known as skip connection. However, this strategy of reusing earlier features directly without processing could reduce the performance of the network. To address this problem, we propose a novel fully convolutional network (FCN) that adopts attention based re-weighting to extract buildings from aerial imagery. Specifically, we consider the semantic gap between features from different stages and leverage the attention mechanism to bridge the gap prior to the fusion of features. The inferred attention weights along spatial and channel-wise dimensions make the low level feature maps adaptive to high level feature maps in a target-oriented manner. Experimental results on three publicly available aerial imagery datasets show that the proposed model (RFA-UNet) achieves comparable and improved performance compared to other state-of-the-art models for building extraction.to detect large objects [14,15]. Since Long et al. [16] adapted the classification network into fully convolutional network (FCN) for semantic segmentation, FCN and its extensions have gradually become the preferred solution in the field of semantic labeling [17][18][19][20]. Though FCN-based methods can produce dense pixel-wise output directly, the pixel-wise classification derived from the final score map is quite coarse because of the sequential sub-sampling operations in the FCN.To address the problem of coarse predictions, recent research [21][22][23][24][25][26] have further improved FCN-based methods for semantic labeling of remote sensing images. There is a growing body of literature that many studies [27][28][29][30][31] employ the encoder-decoder architecture with skip connection. UNet [32], a typical model in the style of encoder-decoder, reuses low-level information to refine the output, and results in better performance. For obtaining accurate labeling of VHR images, an effective structure to integrate the high-resolution, low-level features, and the low-resolution, high-level features is needed. The skip connection fuses features so as to compensate the loss of spatial information caused by repeating local operations (e.g., pooling and strided convolution). Features via skip connection are multi-scale in nature due to the increasingly large receptive field sizes [33]. However, one thing to note is that most existing approaches that are built on top of a contemporary classification network are good at aggregating global contexts. While the reuse of information from early encoding layers contributes to localization in the decoding phase, it may introduce redundant information which results in over-segmentation [3...