Surface water plays a crucial role in climate change, human production, and life, making accurate monitoring and observation of surface water particularly important. However, due to the significant diversity and complexity of water distribution in surface space, accurate mapping of surface water faces considerable challenges. When extracting water bodies from medium-resolution satellite remote sensing images, CNN methods may suffer from limitations in receptive fields and insufficient context modeling capabilities, leading to the loss of water body boundary details and poor fusion of multiscale features. Currently, there is relatively little research on this issue; therefore, it is necessary to explore new combinations of deep learning networks to address these challenges. The purpose of this study is to address the above issues. We propose a new combination of deep learning networks that fully utilize multiscale information to enhance water features. Specifically, we first combine deformable convolutions with the Swin Transformer to increase effective receptive fields while better integrating global semantic information. This combination can capture features of water bodies at different scales, improve the accuracy and integrity of water extraction, and provide reliable technical support for detailed water body extraction. We tested the newly constructed model using Sentinel-2 satellite images. Our model achieved results of over 90%, with an average accuracy of 97.89%, average precision of 94.98%, average recall of 90.05%, and an average F1 score of 92.33%. In addition, our model achieved an accuracy of 98.03% in mountainous areas. Our experiments and results validate the potential of combining the Swin Transformer and deformable convolutions in detailed water body extraction.