The primary NDT method for welding defects is the image-based detection. Currently, the best performance for image-based detection is based on the transformer model. However, with its high accuracy, it has many limitations, such as large model parameters, large data sample requirements, and expensive computer resources. This model has a weaker ability to capture local features compared with global features. In this study, an improved and optimized welding defect detection and identification framework named Fast Multi-Path Vision transformer (FMPVit) is proposed based on the transformer model. This model uses a multilayer parallel architecture and enhances the local information capture ability of the model through advanced multiscale convolution feature aggregation and the addition of a new local convolution module. Finally, a validation test is carried out using an open dataset of weld seams. The model is proven to exhibit an evident performance improvement over the mainstream model baseline.