In satellite remote sensing images, the existence of clouds has an occlusion effect on ground information. Different degrees of clouds make it difficult for existing models to accurately detect clouds in images due to complex scenes. The detection and extraction of clouds is one of the most important problems to be solved in the further analysis and utilization of image information. In this article, we refined a multi-head soft attention convolutional neural network incorporating spatial information modeling (MSACN). During the encoder process, MSACN extracts cloud features through a concurrent dilated residual convolution module. In the part of the decoder, there is an aggregating feature module that uses a soft attention mechanism. It integrates the semantic information with spatial information to obtain the pixel-level semantic segmentation outputs. To assess the applicability of MSACN, we compare our network with Transform-based and other traditional CNN-based methods on the ZY-3 dataset. Experimental outputs including the other two datasets show that MSACN has a better overall performance for cloud extraction tasks, with an overall accuracy of 98.57%, a precision of 97.61%, a recall of 97.37%, and F1-score of 97.48% and an IOU of 95.10%.