In remotely sensed images, high intra-class variance and inter-class similarity are ubiquitous due to complex scenes and objects with multivariate features, making semantic segmentation a challenging task. Deep convolutional neural networks can solve this problem by modelling the context of features and improving their discriminability. However, current learning paradigms model the feature affinity in spatial dimension and channel dimension separately and then fuse them in a sequential or parallel manner, leading to suboptimal performance. In this study, we first analyze this problem practically and summarize it as attention bias that reduces the capability of network in distinguishing weak and discretely distributed objects from widerange objects with internal connectivity, when modeled only in spatial or channel domain. To jointly model both spatial and channel affinity, we design a synergistic attention module (SAM), which allows for channel-wise affinity extraction while preserving spatial details. In addition, we propose a synergistic attention perception neural network (SAPNet) for the semantic segmentation of remote sensing images. The hierarchicalembedded synergistic attention perception module aggregates SAM-refined features and decoded features. As a result, SAPNet enriches inference clues with desired spatial and channel details. Experiments on three benchmark datasets show that SAPNet is competitive in accuracy and adaptability compared with stateof-the-art methods. The experiments also validate the hypothesis of attention bias and the efficiency of SAM.