Utilizing convolutional neural network (CNN) models, computer vision technology has become a reliable and powerful tool for detecting potential damage in concrete structures at the pixel level. In this study, an advanced SWIN U-Net architecture was introduced to detect concrete cracks. The model integrates attention-based convolutional neural networks to enhance the speed and accuracy of crack detection significantly. The distinctive features of the SWIN Transformer make the application of the model to images of varying sizes possible while the computational resources are used efficiently. To train the model, a dataset consisting of crack images, each accompanied by a corresponding mask that highlighted the relevant regions within the image, was used. The training data were augmented using Flip, Rotate, Random Contrast, Random Gamma, Random Brightness, Elastic Transformation, Grid Distortion, and Optical Distortion to counter potential overfitting. Additionally, L2 and spatial dropout regularization techniques were applied to the proposed model. The model was fine-tuned using stochastic gradient descent with the Adam optimizer, employing the binary cross-entropy loss function, and a learning rate of 0.001. The model was trained over 100 epochs with an adjustment scheduler. The performance of the model was evaluated using various metrics. Then, it was compared with three benchmarks and impressive results were achieved. Notably, the Dice loss and IOU values were 93% and 79%, respectively. The trained model had the exceptional performance score of 0.99 in accuracy, precision, recall, F1, and sensitivity.