Dam is an essential structure in hydraulic engineering, and its surface cracks pose significant threats to its integrity, impermeability, and durability. Automated crack detection methods based on computer vision offer substantial advantages over manual approaches with regard to efficiency, objectivity and precision. However, current methods face challenges such as misidentification, discontinuity, and loss of details when analyzing real-world dam crack images. These images often exhibit characteristics such as low contrast, complex backgrounds, and diverse crack morphologies. To address the above challenges, this paper presents a pure Vision Transformer (ViT)-based dam crack segmentation network (DCST-net). The DCST-net utilizes an improved Swin Transformer (SwinT) block as the fundamental block for enhancing the long-range dependencies within a SegNet-like encoder–decoder structure. Additionally, we employ a weighted attention block to facilitate side fusion between the symmetric pair of encoder and decoder in each stage to sharpen the edge of crack. To demonstrate the superior performance of our proposed method, six semantic segmentation models have been trained and tested on both a self-built dam crack dataset and two publicly available datasets. Comparison results indicate that our proposed model outperforms the mainstream methods in terms of visualization and most evaluation metrics, highlighting its potential for practical application in dam safety inspection and maintenance.