Road crack detection is crucial for maintaining and inspecting civil infrastructure, as cracks can pose a potential risk for sustainable road safety. Traditional methods for pavement crack detection are labour-intensive and time-consuming. In recent years, computer vision approaches have shown encouraging results in automating crack localization. However, the classical convolutional neural network (CNN)-based approach lacks global attention to the spatial features. To improve the crack localization in the road, we designed a vision transformer (ViT) and convolutional neural networks (CNNs)-based encoder and decoder. In addition, a gated-attention module in the decoder is designed to focus on the upsampling process. Furthermore, we proposed a hybrid loss function using binary cross-entropy and Dice loss to evaluate the model’s effectiveness. Our method achieved a recall, F1-score, and IoU of 98.54%, 98.07%, and 98.72% and 98.27%, 98.69%, and 98.76% on the Crack500 and Crack datasets, respectively. Meanwhile, on the proposed dataset, these figures were 96.89%, 97.20%, and 97.36%.